Table of Contents
Time Series Forecasting using Deep Learning¶
Pre-order my new book: Time Series with PyTorch: Modern Deep Learning Toolkit for Real-World Forecasting Challenges.
Forecasting the future is an extremely valuable superpower. The forecasting game has been dominated by statisticians who are real experts in time series problems. As the amount of data increases, many of the statistical methods are not squeezing enough out of the massive datasets. Consequently, time series forecasting using deep learning emerges and became a fast-growing field. It is trendy, not only in LinkedIn debates but also in academic papers. We plotted the number of related publications per year using the keyword "deep learning forecasting" on dimensions.ai) 2.
On the other hand, deep learning methods are not yet winning all the games of forecasting. Time series forecasting is a complicated problem with a great variety of data generating processes (DGP). Some argue that we don't need deep learning to forecast since well tuned statistical models and trees are already performing well and are faster and more interpretable than deep neural networks3 4. Ensembles of statistical models performing great, even outperforming many deep learning models on the M3 data1.
However, deep learning models are picking up speed. In the M5 competition, deep learning "have shown forecasting potential, motivating further research in this direction"5. As the complexity and size of time series data are growing and more and more deep learning forecasting models are being developed, forecasting with deep learning is on the path to be an important alternative to statistical forecasting methods.
In Coding Tips, we provide coding tips to help some readers set up the development environment. In Deep Learning Fundamentals, we introduce the fundamentals of deep neural networks and their practices. For completeness, we also provide code and derivations for the models. With these two parts, we introduce time series data and statistical forecasting models in Time Series Forecasting Fundamentals, where we discuss methods to analyze time series data, several universal data generating processes of time series data, and some statistical forecasting methods. Finally, we fulfill our promise in the title in Time Series Forecasting with Deep Learning.
Blueprint¶
The following is my first version of the blueprint.
- Engineering Tips
- Environment, VSCode, Git, ...
- Python Project Tips
- Fundamentals of Time Series Forecasting
- Time Series Data and Terminologies
- Transformation of Time Series
- Two-way Fixed Effects
- Time Delayed Embedding
- Data Generating Process (DGP)
- DGP: Langevin Equation
- Kindergarten Models for Time Series Forecasting
- Statistical Models
- Statistical Model: AR
- Statistical Model: VAR
- Synthetic Datasets
- Synthetic Time Series
- Creating Synthetic Dataset
- Data Augmentation
- Forecasting
- Time Series Forecasting Tasks
- Naive Forecasts
- Evaluation and Metrics
- Time Series Forecasting Evaluation
- Time Series Forecasting Metrics
- CRPS
- Hierarchical Time Series
- Hierarchical Time Series Data
- Hierarchical Time Series Reconciliation
- Some Useful Datasets
- Trees
- Tree-based Models
- Random Forest
- Gradient Boosted Trees
- Forecasting with Trees
- Fundamentals of Deep Learning
- Deep Learning Introduction
- Learning from Data
- Neural Networks
- Recurrent Neural Networks
- Convolutional Neural Networks
- Transformers
- Dynamical Systems
- Why Dynamical Systems
- Neural ODE
- Energy-based Models
- Diffusion Models
- Generative Models
- Autoregressive Model
- Auto-Encoder
- Variational Auto-Encoder
- Flow
- Generative Adversarial Network (GAN)
- Time Series Forecasting with Deep Learning
- A Few Datasets
- Forecasting with MLP
- Forecasting with RNN
- Forecasting with Transformers
- TFT
- DLinear
- NLinear
- Forecasting with CNN
- Forecasting with VAE
- Forecasting with Flow
- Forecasting with GAN
- Forecasting with Neural ODE
- Forecasting with Diffusion Models
- Extras Topics, Supplementary Concepts, and Code
- DTW and DBA
- f-GAN
- Info-GAN
- Spatial-temporal Models, e.g., GNN
- Conformal Prediction
- Graph Neural Networks
- Spiking Neural Networks
- Deep Infomax
- Contrastive Predictive Coding
- MADE
- MAF
- ...
-
Nixtla. statsforecast/experiments/m3 at main · Nixtla/statsforecast. In: GitHub [Internet]. [cited 12 Dec 2022]. Available: https://github.com/Nixtla/statsforecast/tree/main/experiments/m3 ↩
-
Hook DW, Porter SJ, Herzog C. Dimensions: Building context for search and evaluation. Frontiers in Research Metrics and Analytics 2018; 3: 23. ↩
-
Elsayed S, Thyssens D, Rashed A, Jomaa HS, Schmidt-Thieme L. Do we really need deep learning models for time series forecasting? 2021. doi:10.48550/ARXIV.2101.02118. ↩
-
Grinsztajn L, Oyallon E, Varoquaux G. Why do tree-based models still outperform deep learning on tabular data? 2022. doi:10.48550/ARXIV.2207.08815. ↩
-
Makridakis S, Spiliotis E, Assimakopoulos V. M5 accuracy competition: Results, findings, and conclusions. International journal of forecasting 2022; 38: 1346–1364. ↩
Engineering Tips ↵
Coding Tips¶
In this book, we use Python as our programming language. In the main chapters, we will focus on the theories and actual code and skip the basic concepts. To make sure we are on the same page, we shove all the tech stack related topics into this chapter for future reference. It is not necessary to read this chapter before reading the main chapters. However, we recommend the readers go through this chapter at some point to make sure they are not missing some basic engineering concepts.
Info
This chapter is not aiming to be a comprehensive note on these technologies but a few key components that may be missing in many research-oriented tech stacks. We assume the readers have worked with the essential technologies in a Python-based deep learning project.
Good References for Coding in Research¶
Some skills only take a while to learn but people benefit from them for their whole life. Managing code falls exactly into this bucket, for programmers.
The Good Research Code Handbook is a very good and concise guide to building good coding habits. This should be a good first read.
The Alan Turing Institute also has a Research Software Engineering with Python course. This is a comprehensive generic course for boosting Python coding skills in research.
A Checklist of Tech Stack
We provide a concise list of tools for coding. Most of them are probably already integrated into most people's workflows. Hence we provide no descriptions but only the list itself.
In the following diagrams, we highlight the recommended tools using orange color. Clicking on them takes us to the corresponding website.
The first set of checklists is to help us set up a good coding environment.
flowchart TD
classDef highlight fill:#f96;
env["Setting up Coding Environment"]
git["fa:fa-star Git"]:::highlight
precommit["pre-commit"]:::highlight
ide["Integrated Development Environment (IDE)"]
vscode["Visual Studio Code"]:::highlight
pycharm["PyCharm"]
jupyter["Jupyter Notebooks"]
python["Python Environment"]
py_env["Python Environment Management"]
conda["Anaconda"]
pyenv_venv["Pyenv + venv + pip"]
pyenv_poetry["Pyenv + poetry"]
poetry["Poetry"]:::highlight
pyenv["pyenv"]:::highlight
venv["venv"]
click git "https://git-scm.com/" "Git"
click precommit "https://pre-commit.com/" "pre-commit"
click vscode "https://code.visualstudio.com/" "Visual Studio Code"
click jupyter "https://jupyter.org/" "Jupyter Lab"
click pycharm "https://www.jetbrains.com/pycharm/" "PyCharm"
click conda "https://www.anaconda.com/" "Anaconda"
click pyenv "https://github.com/pyenv/pyenv" "pyenv"
click venv "https://docs.python.org/3/library/venv.html" "venv"
click poetry "https://python-poetry.org/" "poetry"
env --- git
git --- precommit
env --- ide
ide --- vscode
ide --- jupyter
ide --- pycharm
env --- python
python --- py_env
py_env --- conda
py_env --- pyenv_venv
py_env --- pyenv_poetry
pyenv_venv --- pyenv
pyenv_venv --- venv
pyenv_poetry --- pyenv
pyenv_poetry --- poetry
The second set of checklists is to boost our code quality.
flowchart TD
classDef highlight fill:#f96;
python["Python Code Quality"]
test["Test Your Code"]
formatter["Formatter"]
linter["Linter"]
pytest["pytest"]:::highlight
black["black"]:::highlight
isort["isort"]:::highlight
pylint["pylint"]
flake8["flake8"]
pylama["pylama"]
mypy["mypy"]:::highlight
click pytest "https://pytest.org/" "pytest"
click black "https://github.com/psf/black" "black"
click isort "https://github.com/pycqa/isort"
click mypy "http://mypy-lang.org/"
click pylint "https://pylint.pycqa.org/"
click flake8 "https://flake8.pycqa.org/en/latest/"
click pylama "https://github.com/klen/pylama"
python --- test
test --- pytest
python --- formatter
formatter --- black
formatter --- isort
python --- linter
linter --- mypy
linter --- pylint
linter --- flake8
linter ---pylama
Finally, we also mention the primary python packages used here.
flowchart TD
classDef highlight fill:#f96;
dataml["Data and Machine Learning"]
pandas["Pandas"]:::highlight
pytorch["PyTorch"]:::highlight
lightning["PyTorch Lightning"]:::highlight
much_more["and more ..."]
click pandas "https://pandas.pydata.org/"
click pytorch "https://pytorch.org/"
click lightning "https://www.pytorchlightning.ai/"
dataml --- pandas
dataml --- pytorch
dataml --- lightning
dataml --- much_more
Python¶
We assume the readers have a good understanding of the Python programming language, as Python will be the primary programming language for demos and tutorials in this book. For engineering tips, we will cover a few topics here, including
- Environment management;
- Dependency management;
pre-commit.
TL;DR
- Use pyenv to manage Python versions;
- Use poetry to manage dependencies;
- Always set up `pre-commit`` for your git repository.
Python Environment Management¶
Environment management is never easy, and the same is true for the Python ecosystem. People have developed a lot of tools to make environment management easier. As you could imagine, this also means we have a zoo of tools to choose from.
There are three things to manage in a Python project:
- Python version,
- Dependencies of the project, and
- An environment where we install our dependencies.
Some tools can manage all three, and some tools focus on one or two of them. We discuss two popular sets of tools: conda and pyenv + poetry.
conda¶
Many data scientists started with the simple and out-of-the-box choice called conda. conda is an all-in-one toolkit to manage Python versions, environments, and project dependencies.
conda cheatsheet
The most useful commands for conda are the following.
- Create an environment:
conda create -n my-env-name python=3.9 pip, wheremy-env-nameis the name of the environment,python=3.9specifies the version of Python,pipat the end is tellingcondato installpipin this new environment.
- Activate an environment:
conda activate my-env-name - Install new dependency:
conda install pandas - List all available environments:
conda env list
Anaconda provides a nice cheatsheet.
pyenv + poetry¶
conda is powerful, but it is too powerful for a simple Python project. As of 2024, if you ask around, many Python developers will recommend poetry.
poetry manages dependencies and environments. We just need a tool like pyenv to manage Python versions.
The poetry workflow
To work with poetry in an existing project my_kuhl_project
poetry initto initialize the project and follow the instructions;poetry env use 3.10to specify the Python version. In this example, we use3.10;poetry add pandasto add a package calledpandas.
Everything we specified will be written into the pyproject.toml file.
poetry provides a nice tutorial on its website.
Dependency Specifications¶
We have a few choices to specify the dependencies. The most used method at the moment is requirements.txt. However, specifying dependencies in pyproject.toml is a much better choice.
Python introduced pyproject.toml in PEP518 which can be used together with poetry to manage dependencies.
While tutorials on how to use poetry are beyond the scope of this book, we highly recommend using poetry in a formal project.
poetry is sometimes slow
poetry can be very slow as it has to load many different versions of the packages to try out in some cases56.
conda with pip
If one insists on using conda, here we provide a few tips for conda users.
conda provides its own requirement specification using environment.yaml. However, many projects still prefer requirements.txt even though conda's environment.yaml is quite powerful.
To use requirements.txt and pip, we always install pip when creating a new environment, e.g., conda create -n my-env-name python=3.9 pip.
Once the environment is activated (conda activate my-env-name), we can use pip to install dependencies, e.g., pip install -r requirements.txt.
Python Styles and pre-commit¶
In a Python project, it is important to have certain conventions or styles. To be consistent, one could follow some style guides for Python. There are official proposals, such as PEP8, and "third party" style guides, such as Google Python Style Guide 34.
We also recommend pre-commit. pre-commit helps us manage git hooks to be executed before each commit. Once installed, every time we run git commit -m "my commit message here", a series of commands will be executed first based on the configurations.
pre-commit officially provides some hooks already, e.g., trailing-whitespace 2.
We also recommend the following hooks,
black, which formats the code based on pre-defined styles,isort, which orders the Python imports1,mypy, which is a linter for Python.
The following is an example .pre-commit-config.yaml file for a Python project.
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.2.0
hooks:
- id: check-added-large-files
- id: debug-statements
- id: detect-private-key
- id: end-of-file-fixer
- id: requirements-txt-fixer
- id: trailing-whitespace
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v0.960
hooks:
- id: mypy
args:
- "--no-strict-optional"
- "--ignore-missing-imports"
- repo: https://github.com/ambv/black
rev: 22.6.0
hooks:
- id: black
language: python
args:
- "--line-length=120"
- repo: https://github.com/pycqa/isort
rev: 5.10.1
hooks:
- id: isort
name: isort (python)
args: ["--profile", "black"]
Write docstrings¶
Writing docstrings for functions and classes can help our future self understand them more easily. There are different styles for docstrings. Two of the popular ones are
Test Saves Time¶
Adding tests to our code can save us time. We will not list all these benefits of having tests. But tests can help us debug our code and ship results more confidently. For example, suppose we are developing a function and spot a bug. One of the best ways of debugging it is to write a test and put a debugger breakpoint at the suspicious line of the code. With the help of IDEs such as Visual Studio Code, this process can save us a lot of time in debugging.
Use pytest
Use pytest. RealPython provides a good short introduction. The Alan Turing Institue provides some lectures on testing and pytest.
-
Pre Commit. In: isort [Internet]. [cited 22 Jul 2022]. Available: https://pycqa.github.io/isort/docs/configuration/pre-commit.html ↩
-
pre-commit-config-pre-commit-hooks.yaml. In: Gist [Internet]. [cited 22 Jul 2022]. Available: https://gist.github.com/lynnkwong/f7591525cfc903ec592943e0f2a61ed9 ↩
-
Guido van Rossum, Barry Warsaw, Nick Coghlan. PEP 8 – Style Guide for Python Code. In: peps.python.org [Internet]. 5 Jul 2001 [cited 23 Jul 2022]. Available: https://peps.python.org/pep-0008/ ↩
-
Google Python Style Guide. In: Google Python Style Guide [Internet]. [cited 22 Jul 2022]. Available: https://google.github.io/styleguide/pyguide.html ↩
-
Poetry is extremely slow when resolving the dependencies · Issue #2094 · python-poetry/poetry. In: GitHub [Internet]. [cited 23 Jul 2022]. Available: https://github.com/python-poetry/poetry/issues/2094 ↩
-
FAQ. In: Poetry - Python dependency management and packaging made easy [Internet]. [cited 29 Jan 2024]. Available: https://python-poetry.org/docs/faq/#why-is-the-dependency-resolution-process-slow ↩
Ended: Engineering Tips
Fundamentals of Time Series Forecasting ↵
Time Series Data and Statistical Forecasting Mothods¶
Time Series Data¶
Time series data comes from a variety of data generating processes. There are also different formulations and views of time series data.
Time series data can be formulated as a sequence of vector functions of time 1. There are many different types of tasks on time series data, for example,
- classification,
- anomaly detection, and
- forecasting.
In this chapter, we focus on the forecasting problem.
The Forecasting Problem¶
To make it easier to formulate the forecasting problem, we group the time series features based on the role they play in a forecasting problem. Given a dataset \(\mathcal D\), with
- \(y^{(i)}_t\), the sequential variable to be forecasted,
- \(x^{(i)}_t\), exogenous data for the time series data,
- \(u^{(i)}_t\), some features that can be obtained or planned in advance,
where \({}^{(i)}\) indicates the \(i\)th variable, \({}_ t\) denotes time. In a forecasting task, we use \(y^{(i)} _ {t-K:t}\), \(x^{(i) _ {t-K:t}}\), and \(u^{(i)} _ {t-K:t+H}\), to forecast the future \(y^{(i)} _ {t+1:t+H}\). In these notations, \(K\) is the input sequence length and \(H\) is the forecast horizon.

A forecasting model \(f\) will use \(x^{(i)} _ {t-K:t}\) and \(u^{(i)} _ {t-K:t+H}\) to forecast \(y^{(i)} _ {t+1:t+H}\).
In the section Time Series Forecasting Tasks, we will discuss more details of the forecasting problem.
Categories of Forecasting Methods¶
Januschowsk et al proposed a framework to classify the different forecasting methods2. We illustrate the different methods in the following charts. For simplicity, we simply merge all the possible dimensions in one chart.
flowchart LR
classDef subjective fill:#EE8866;
classDef objective fill:#77AADD;
dimensions["Dimensions of Forecasting Methods"]
%% Objective
params_shared["Parameter Shared Accross Series"]:::objective
params_shared --"True"-->Global:::objective
params_shared --"False"-->Local:::objective
uncertainty["Uncertainty in Forecasts"]:::objective
uncertainty --"True"--> Probabilistic["Probabilistic Forecasts:\n forecasts with predictive uncertainty"]:::objective
uncertainty --"False"--> Point["Point Forecasts"]:::objective
computational_complexity["Computational Complexity"]:::objective
linear_convexity["Linear and Convexity"]:::objective
dimensions --> params_shared
dimensions --> uncertainty
dimensions --> computational_complexity
dimensions --> linear_convexity
%% Subjective
structural_assumptions["Strong Structural Assumption"]:::subjective --"Yes"--> model_driven["Model-Driven"]:::subjective
structural_assumptions --"No"--> data_driven["Data-Driven"]:::subjective
model_comb["Model Combinations"]:::subjective
discriminative_generative["Discriminative or Generative"]:::subjective
theoretical_guarantees["Theoretical Guarantees"]:::subjective
predictability_interpretability["Predictability and Interpretibility"]:::subjective
dimensions --> structural_assumptions
dimensions --> model_comb
dimensions --> discriminative_generative
dimensions --> theoretical_guarantees
dimensions --> predictability_interpretability
We will mention those different dimensions later in our discussion of different forecasting models. For example, random forest is an ensemble method, which we will discuss in detail later.
-
Dorffner G. Neural networks for time series processing. Neural Network World 1996; 6: 447–468. ↩
-
Januschowski T, Gasthaus J, Wang Y, Salinas D, Flunkert V, Bohlke-Schneider M et al. Criteria for classifying forecasting methods. International journal of forecasting 2020; 36: 167–177. ↩
Time Series Data ↵
Time Series Analysis¶
Time series analysis is not our focus here. However, it is beneficial to grasp some basic ideas of time series.
Stationarity¶
Time series data is stationary if the distribution of the observables do not change126.
A strict stationary series guarantees the same distribution for a segment \(\{x_{i+1}, \cdots, x_{x+k}\}\) and a time-shifted segment \(x_{i+1+\Delta}, \cdots, x_{x+k+\Delta}\}\) for integer \(\Delta\)1.
A less strict form (WSS) concerns only the mean and autocorrelation13, i.e.,
In deep learning, a lot of models require the training data to be I.I.D.47. The I.I.D. requirement in time series is stationarity.
A stationary time series is clean and pure. However, real-world data is not necessarily stationary, e.g., macroeconomic series data are non-stationary6.
Serial Dependence¶
Autocorrelation measures the serial dependency of a time series5. By definition, the autocorrelation is the autocovariance normalized by the variance,
One naive expectation is that the autocorrelation diminishes if \(\delta \to \infty\)3.
Terminology¶
Terminologies for time series data may be different in different fields8. For example, we may encounter the term "panel data" in econometrics, which is the same as "multivariate time series" in "data science".
Panel Data
Panel data is multivariate time series data,
| time | variable \(y_1\) | variable \(y_2\) | variable \(y_3\) |
|---|---|---|---|
| \(t_1\) | \(y_{11}\) | \(y_{21}\) | \(y_{31}\) |
| \(t_2\) | \(y_{12}\) | \(y_{22}\) | \(y_{32}\) |
| \(t_3\) | \(y_{13}\) | \(y_{23}\) | \(y_{33}\) |
| \(t_4\) | \(y_{14}\) | \(y_{24}\) | \(y_{34}\) |
| \(t_5\) | \(y_{15}\) | \(y_{25}\) | \(y_{35}\) |
-
Contributors to Wikimedia projects. Stationary process. In: Wikipedia [Internet]. 18 Sep 2022 [cited 13 Nov 2022]. Available: https://en.wikipedia.org/wiki/Stationary_process ↩↩↩
-
6.4.4.2. Stationarity. In: Engineering Statistics Handbook [Internet]. NIST; [cited 13 Nov 2022]. Available: https://www.itl.nist.gov/div898/handbook/pmc/section4/pmc442.htm#:~:text=Stationarity%20can%20be%20defined%20in,no%20periodic%20fluctuations%20(seasonality). ↩
-
Shalizi C. 36-402, Undergraduate Advanced Data Analysis (2012). In: Undergraduate Advanced Data Analysis [Internet]. 2012 [cited 13 Nov 2022]. Available: https://www.stat.cmu.edu/~cshalizi/uADA/12/ ↩↩
-
Schölkopf B, Locatello F, Bauer S, Ke NR, Kalchbrenner N, Goyal A, et al. Toward Causal Representation Learning. Proc IEEE. 2021;109: 612–634. doi:10.1109/JPROC.2021.3058954 ↩
-
Contributors to Wikimedia projects. Autocorrelation. In: Wikipedia [Internet]. 10 Nov 2022 [cited 13 Nov 2022]. Available: https://en.wikipedia.org/wiki/Autocorrelation ↩
-
Das P. Econometrics in Theory and Practice. Springer Nature Singapore; doi:10.1007/978-981-32-9019-8 ↩↩
-
Dawid P, Tewari A. On learnability under general stochastic processes. Harvard Data Science Review. 2022; ↩
-
Hyndman R. Rob J Hyndman - Terminology matters. In: Rob J Hyndman [Internet]. 26 Jun 2020 [cited 9 Nov 2023]. Available: https://robjhyndman.com/hyndsight/terminology-matters/#same-concept-different-terminology ↩
Box-Cox Transformation¶
Many time series models require stationary data. However, real-world time series data may be non-stationary and heteroscedastic1. Box-cox transformation is useful when reducing the non-stationarity and heteroscedasticity.
Rob J Hyndman and George Athanasopoulos's famous textbook FPP2 provides some nice examples of box-cox transformations.
To see Box-Cox transformation in action, we show an example using the air passenger dataset.
The air passenger dataset is a monthly dataset. We can observe the trend and varying variance simply by eyes.

Applying Box-Cox transformations with different lambdas leads to different results shown below.

To check the variance, we plot out the variance rolling on a 12-month window.

Box-Cox transformation with \(\lambda =0.1\) reduces the variability in variance.

Box-Cox May not Always Reach Perfect Stationary Data
Box-Cox transformation is a simple transformation that helps us reduce the non-stationarity and heteroscedasticy. However, we may not always be able to convert the dataset to stationary and homoscedastic data. This can be observed by performing checks using tools such as stationarity_tests in Darts.
-
Homoscedasticity and heteroscedasticity. (2023, June 2). In Wikipedia. https://en.wikipedia.org/wiki/Homoscedasticity_and_heteroscedasticity ↩
Two-Way Fixed Effects¶
Two-way fixed effects on [panel data is a handy method for establishing linear models from time series data. To keep our notations consistent, we will use the term multivariate time series to refer to panel data in the following content.
Two-way Fixed Effects Model¶
A two-way fixed effects model is a linear model that allows the parameters to vary across both time and the variables1,
where \(\alpha_i\) and \(\gamma_t\) represent the effect coming from the variable and time, respectively.
Example¶
To help readers outside of econometrics or causal inference get started with this model, we will use a simple example to illustrate the idea. We will construct a naive dataset with three groups and two variables linearly related to each other.
We construct a naive dataset that contains three articles (column name),
each having a different distribution of prices and demand,
while all of them are generated with the same linear relation
between the variable log_demand and log_price.
The data points also fluctuate in time (column step).
Using a simple linear model with both time (step) and variable (name) fixed effects, we obtain the following results.
Estimation: OLS
Dep. var.: log_demand, Fixed effects: name+step
Inference: CRV1
Observations: 1450
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5 % | 97.5 % |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|---------:|
| log_price | -2.972 | 0.004 | -680.195 | 0.000 | -2.991 | -2.953 |
---
RMSE: 0.003 Adj. R2: 1.0 Adj. R2 Within: 1.0
pyfixest==0.10.10.0
seaborn==0.13.0
eerily==0.2.1
import numpy as np
import pandas as pd
import random
from pyfixest.estimation import feols
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
from eerily.generators.elasticity import ElasticityStepper, LinearElasticityParams
from eerily.generators.naive import (
ConstantStepper,
ConstStepperParams,
SequenceStepper,
SequenceStepperParams,
)
from eerily.generators.utils.choices import Choices
# %% [markdown]
# ## Generate Data
# %%
def create_one_article(
elasticity_value, length, article_id, initial_condition,
log_prices, first_step=0
):
es = ElasticityStepper(
model_params=LinearElasticityParams(
initial_state=initial_condition,
log_prices=iter(log_prices),
elasticity=iter([elasticity_value + (random.random() - 0.5)/10] * length),
variable_names=["log_demand", "log_price", "elasticity"],
),
length=length
)
ss = SequenceStepper(
model_params=SequenceStepperParams(
initial_state=[first_step], variable_names=["step"], step_sizes=[1]
),
length=length
)
cs = ConstantStepper(
model_params=ConstStepperParams(initial_state=[article_id], variable_names=["name"]),
length=length
)
return (es & ss & cs)
initial_condition = {"log_demand": 3, "log_price": 1, "elasticity": None}
length_1 = 200
length_2 = 400
length_3 = 850
log_price_choices_1 = Choices(elements=[1,1.1, 1.2, 1.3, 1.4, 1.5])
log_price_choices_2 = Choices(elements=[1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9])
log_price_choices_3 = Choices(elements=[2, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8])
log_prices_1 = [next(log_price_choices_1) for i in range(length_1)]
log_prices_2 = [next(log_price_choices_2) for i in range(length_2)]
log_prices_3 = [next(log_price_choices_3) for i in range(length_3)]
data_gen = (
create_one_article(elasticity_value=-3, length=length_1, article_id="article_1", initial_condition=initial_condition, log_prices=log_prices_1)
+ create_one_article(elasticity_value=-3, length=length_2, article_id="article_2", initial_condition=initial_condition, log_prices=log_prices_2)
+ create_one_article(elasticity_value=-3, length=length_3, article_id="article_3", initial_condition=initial_condition, log_prices=log_prices_3)
)
# %%
df = pd.DataFrame(list(data_gen))
# %% [markdown]
# ## Visualizations
# %%
fig, ax = plt.subplots(figsize=(10, 6.18))
sns.scatterplot(
df,
x="log_price",
y="log_demand",
hue="step",
style="name"
)
# %% [markdown]
# ## Estimation
# %%
fit_feols = feols(
fml="log_demand ~ log_price | name + step",
data=df
)
# %%
fit_feols.summary()
Tools and Further Reading
In the R world, fixest is a popular package for estimating two-way fixed effects models. In the Python world, we have something similar called pyfixest.
-
Imai K, Kim IS. On the use of two-way fixed effects regression models for causal inference with panel data. Political analysis: an annual publication of the Methodology Section of the American Political Science Association 2021; 29: 405–415. ↩
The Time Delay Embedding Representation¶
The time delay embedding representation of time series data is widely used in deep learning forecasting models1. This is also called rolling in many time series analyses 2.
For simplicity, we only write down the representation for a problem with time series \(y_{1}, \cdots, y_{t}\), and forecasting \(y_{t+1}\). We rewrite the series into a matrix, in an autoregressive way,
which indicates that we will use everything on the left, a matrix of shape \((t-p+1,p)\), to predict the vector on the right (in red). This is a useful representation when building deep learning models as many of the neural networks require fixed-length inputs.
Taken's Theorem¶
The reason that time delayed embedding representation is useful is that it is a representation of the original time series that preserves the dynamics of the original time series, if any. The math behind it is the Taken's theorem 3.
To illustrate the idea, we take our pendulum dataset as an example. The pendulum dataset describes a damped pendulum, for the math and visualizations please refer to the corresponding page. Here we apply the time delay embedding representation to the pendulum dataset by setting both the history length and the target length to 1, so that we can better visualize it.
We plot out the delayed embedding representation of the pendulum dataset. The x-axis is the value of the pendulum angle at time \(t\), and the y-axis is the value of the pendulum angle at time \(t+1\). The animation shows how the delayed embedding representation evolves over time and shows attractor behavior. If a model can capture this dynamics, it can make good predictions.

The notebook for more about the dataset itself is here.
from functools import cached_property
from typing import List, Tuple
import matplotlib as mpl
import matplotlib.animation as animation
import matplotlib.pyplot as plt
import pandas as pd
from ts_dl_utils.datasets.dataset import DataFrameDataset
from ts_dl_utils.datasets.pendulum import Pendulum
ds_de = DataFrameDataset(dataframe=df["theta"][:200], history_length=1, horizon=1)
class DelayedEmbeddingAnimation:
"""Builds an animation for univariate time series
using delayed embedding.
```python
fig, ax = plt.subplots(figsize=(10, 10))
dea = DelayedEmbeddingAnimation(dataset=ds_de, fig=fig, ax=ax)
ani = dea.build(interval=10, save_count=dea.time_steps)
ani.save("results/pendulum_dataset/delayed_embedding_animation.mp4")
```
:param dataset: a PyTorch dataset, input and target should have only length 1
:param fig: figure object from matplotlib
:param ax: axis object from matplotlib
"""
def __init__(
self, dataset: DataFrameDataset, fig: mpl.figure.Figure, ax: mpl.axes.Axes
):
self.dataset = dataset
self.ax = ax
self.fig = fig
@cached_property
def data(self) -> List[Tuple[float, float]]:
return [(i[0][0], i[1][0]) for i in self.dataset]
@cached_property
def x(self):
return [i[0] for i in self.data]
@cached_property
def y(self):
return [i[1] for i in self.data]
def data_gen(self):
for i in self.data:
yield i
def animation_init(self) -> mpl.axes.Axes:
ax.plot(
self.x,
self.y,
)
ax.set_xlim([-1.1, 1.1])
ax.set_ylim([-1.1, 1.1])
ax.set_xlabel("t")
ax.set_ylabel("t+1")
return self.ax
def animation_run(self, data: Tuple[float, float]) -> mpl.axes.Axes:
x, y = data
self.ax.scatter(x, y)
return self.ax
@cached_property
def time_steps(self):
return len(self.data)
def build(self, interval: int = 10, save_count: int = 10):
return animation.FuncAnimation(
self.fig,
self.animation_run,
self.data_gen,
interval=interval,
init_func=self.animation_init,
save_count=save_count,
)
fig, ax = plt.subplots(figsize=(10, 10))
dea = DelayedEmbeddingAnimation(dataset=ds_de, fig=fig, ax=ax)
ani = dea.build(interval=10, save_count=dea.time_steps)
gif_writer = animation.PillowWriter(fps=5, metadata=dict(artist="Lei Ma"), bitrate=100)
ani.save("results/pendulum_dataset/delayed_embedding_animation.gif", writer=gif_writer)
# ani.save("results/pendulum_dataset/delayed_embedding_animation.mp4")
In some advanced deep learning models, delayed embedding plays a crucial role. For example, Large Language Models (LLM) can perform good forecasts by taking in delayed embedding of time series4.
-
Hewamalage H, Ackermann K, Bergmeir C. Forecast evaluation for data scientists: Common pitfalls and best practices. 2022.http://arxiv.org/abs/2203.10716. ↩
-
Zivot E, Wang J. Modeling financial time series with s-PLUS. Springer New York, 2006 doi:10.1007/978-0-387-32348-0. ↩
-
Takens F. Detecting strange attractors in turbulence. In: Lecture notes in mathematics. Springer Berlin Heidelberg: Berlin, Heidelberg, 1981, pp 366–381. ↩
-
Rasul K, Ashok A, Williams AR, Khorasani A, Adamopoulos G, Bhagwatkar R et al. Lag-llama: Towards foundation models for time series forecasting. arXiv [csLG] 2023.http://arxiv.org/abs/2310.08278. ↩
Ended: Time Series Data
Data Generating Process ↵
Generating Processes for Time Series¶
The data generating processes (DGP) for time series are diverse. For example, in physics, we have all sort of dynamical systems that generates time series data and many dynamics models are formulated based on the time series data. In industries, time series data are often coming from stochastic processes.
We present some data generating processes to help us build up intuition when modeling real-world data.
Simple Examples of DGP¶
Exponential Growth
Exponential growth is a frequently observed natural and economical phenomenon.

Circular Motion
The circular motion shows some cyclic patterns.


Random Gaussian
Time series can also be noisy Gaussian samples.

General Linear Processes¶
A popular model for modeling as well as generating time series is the autoregressive (AR) model. An AR is formulated as
AR(p) and the Lag Operator
A general autoregressive model of p-th order is
where \(l\) is the lag.
Define a lag operator \(\hat L\) with \(\hat L x_t = x_{t-1}\). The definition can also be rewritten using the lag operator
We write down each time step in the following table.
| \(t\) | \(x_t\) |
|---|---|
| 0 | \(y_0\) |
| 1 | \(\phi_0 + \phi_1 y_0 + \epsilon_1\) |
| 2 | \(\phi_0 + \phi_1 (\phi_0 + \phi_1 y_0 + \epsilon_1) + \epsilon_2 = \phi_0 (1 + \phi_1) + \phi_1^2 y_0 + \phi_1\epsilon_1 + \epsilon_2\) |
| 3 | \(\phi_0 + \phi_1 (\phi_0 + \phi_1\phi_0 + \phi_1^2 y_0 + \phi_1\epsilon_1 + \epsilon_2) + \epsilon_3 = \phi_0(1 + \phi_1 + \phi_1^2) + \phi_1^3 y_0 + \phi_1^2\epsilon_1 + \phi_1\epsilon_2 + \epsilon_3\) |
| ... | ... |
| \(t\) | \(\phi_0 \sum_{i=0}^{t-1} \phi_1^i + \phi_1^t y_0 + \sum_{i=1}^{t-1} \phi_1^{t-i} \epsilon_{i}\) |
We have found a new formula for AR(1), i.e.
which is very similar to a general linear process1
The general linear process is the Taylor expansion of an arbitrary DGP \(x_t = \operatorname{DGP}(\epsilon_t, ...)\)1.
Interactions between Series¶
The interactions between the series can be modeled as explicit interactions, e.g., many spiking neurons, or through hidden variables, e.g., hidden state model2. Among these models, Vector Autoregressive model, aka VAR, is a simple but popular model.
-
Das P. Econometrics in Theory and Practice. Springer Nature Singapore; doi:10.1007/978-981-32-9019-8 ↩↩
-
Contributors to Wikimedia projects. Hidden Markov model. In: Wikipedia [Internet]. 22 Oct 2022 [cited 22 Nov 2022]. Available: https://en.wikipedia.org/wiki/Hidden_Markov_model ↩
Time Series Data Generating Process: Langevin Equation¶
Among the many data generating processes (DGP), the Langevin equation is one of the most interesting DGP.
Brownian Motion¶
Brownian motion as a very simple stochastic process can be described by the Langevin equation1. In this section, we simulate a time series dataset from Brownian motion.
Macroscopically, Brownian Motion can be described by the notion of random forces on the particles,
where \(v(t)\) is the velocity at time \(t\) and \(R(t)\) is the stochastic force density from the reservoir particles. Solving the equation, we get
To generate a dataset numerically, we discretize it by replacing the integral with a sum,
where \(t_i = i * \Delta t\) and \(t = t_n\), thus the equation is further simplified,
The first term in the solution is responsible for the exponential decay and the second term calculates the effect of the stochastic force.
To simulate a Brownian motion, we can either use the formal solution or the differential equation itself. Here we choose to use the differential equation itself. To simulate the process numerically, we rewrite
as
The following is a simulated 1D Brownian motion.

We create a stepper to calculate the next steps.
import numpy as np
import copy
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
## Define Brownian Motion
class GaussianForce:
"""A Gaussian stochastic force iterator.
Each iteration returns a single sample from the corresponding
Gaussian distribution.
:param mu: mean of the Gaussian distribution
:param std: standard deviation of the Gaussian distribution
:param seed: seed for the random generator
"""
def __init__(self, mu: float, std: float, seed: Optional[float] = None):
self.mu = mu
self.std = std
self.rng = np.random.default_rng(seed=seed)
def __next__(self) -> float:
return self.rng.normal(self.mu, self.std)
class BrownianMotionStepper:
"""Calculates the next step in a brownian motion.
:param gamma: the damping factor $\gamma$ of the Brownian motion.
:param delta_t: the minimum time step $\Delta t$.
:param force_densities: the stochastic force densities, e.g. [`GaussianForce`][eerily.data.generators.brownian.GaussianForce].
:param initial_state: the initial velocity $v(0)$.
"""
def __init__(
self,
gamma: float,
delta_t: float,
force_densities: Iterator,
initial_state: Dict[str, float],
):
self.gamma = gamma
self.delta_t = delta_t
self.forece_densities = copy.deepcopy(force_densities)
self.current_state = copy.deepcopy(initial_state)
def __iter__(self):
return self
def __next__(self) -> Dict[str, float]:
force_density = next(self.forece_densities)
v_current = self.current_state["v"]
v_next = v_current + force_density * self.delta_t - self.gamma * v_current * self.delta_t
self.current_state["force_density"] = force_density
self.current_state["v"] = v_next
return copy.deepcopy(self.current_state)
## Generating time series
delta_t = 0.1
stepper = BrownianMotionStepper(
gamma=0,
delta_t=delta_t,
force_densities=GaussianForece(mu=0, std=1),
initial_state={"v": 0},
)
length = 200
history = []
for _ in range(length):
history.append(next(stepper))
df = pd.DataFrame(history)
fig, ax = plt.subplots(figsize=(10, 6.18))
sns.lineplot(
x=np.linspace(0, length-1, length) * delta_t,
y=df.v,
ax=ax,
marker="o",
)
ax.set_title("Brownian Motion")
ax.set_xlabel("Time")
ax.set_ylabel("Velocity")
-
Ma L. Brownian Motion — Statistical Physics Notes. In: Statistical Physics [Internet]. [cited 17 Nov 2022]. Available: https://statisticalphysics.leima.is/nonequilibrium/brownian-motion.html ↩
Ended: Data Generating Process
Kindergarten Models ↵
Statistical Models of Time Series¶
Though statistical models are not our focus, it is always beneficial to understand how those famous statistical models work. To best understand how the models work, we will build some data generating process using these models and explore their behavior.
In the following paragraphs, we list some of the most applied statistical models. For a comprehensive review of statistical models, please refer to Petropoulos et al., 2022 and Hyndman et al., 202134.
ARIMA¶
ARIMA is one of the most famous forecasting models1. We will not discuss the details of the model. However, for reference, we sketch the relations between the different components of the ARIMA model in the following chart.
flowchart TD
AR --"interdependencies"--> VAR
MA --"add autoregressive"--> ARMA
AR --"add moving average"--> ARMA
ARMA --"difference between values"--> ARIMA
ARMA --"interdependencies"--> VARMA
VAR --"moving average"--> VARMA
ARIMA --"interdependencies"--> VARIMA
VAR --"difference and moving average"--> VARIMA
VARMA --"difference"--> VARIMA
Exponential Smoothing¶
A Naive Forecast
In time series forecasting, one of the naive forecasts we can use it the previous observation, i.e.,
where we use \(\hat s\) to denote the forecasts and \(s\) for the observations.
A naive version of the exponential smoothing model is the Simple Exponential Smoothing (SES)34. The SES is an average of the most recent observation and the previous forecasts.
where \(\hat s\) is the forecast and \(s\) is the observation. Expanding this form, we observe the exponential decaying effect of history in the long past4.
State Space Models¶
State space models (SSM) are amazing models due to their simplicity. SSM applies Markov chains but is not limited to the Markovian assumptions5.
-
Cerqueira V, Torgo L, Soares C. Machine Learning vs Statistical Methods for Time Series Forecasting: Size Matters. arXiv [stat.ML]. 2019. Available: http://arxiv.org/abs/1909.13316 ↩
-
Wu Z, Pan S, Long G, Jiang J, Chang X, Zhang C. Connecting the Dots: Multivariate Time Series Forecasting with Graph Neural Networks. arXiv [cs.LG]. 2020. Available: http://arxiv.org/abs/2005.11650 ↩
-
Petropoulos F, Apiletti D, Assimakopoulos V, Babai MZ, Barrow DK, Ben Taieb S, et al. Forecasting: theory and practice. Int J Forecast. 2022;38: 705–871. doi:10.1016/j.ijforecast.2021.11.001 ↩↩
-
Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. OTexts.com/fpp3. Accessed on 2022-11-27. ↩↩↩
-
Bishop CM. Pattern Recognition and Machine Learning. Springer; 2006. Available: https://play.google.com/store/books/details?id=qWPwnQEACAAJ ↩
AR¶
Autoregressive (AR) models are simple models to model time series. A general AR(p) model is described by the following process:
AR(1)¶
A first order AR model, aka AR(1), is as simple as
By staring at this equation, we can build up our intuitions.
| \(\phi_0\) | \(\phi_1\) | \(\epsilon\) | Behavior |
|---|---|---|---|
| - | \(0\) | - | constant + noise |
| \(0\) | \(1\) | - | constant + noise |
| \(0\) | \(\phi_1>1\) or \(0\le\phi_1 \lt 1\) | - | exponential + noise |
Exponential Behavior doesn't Always Approach Positive Infinity
For example, the combination \(\phi_0=0\) and \(\phi_1>1\) without noise leads to exponential growth if the initial series value is positive. However, it approaches negative infinity if the initial series is negative.




import copy
from dataclasses import dataclass
from typing import Dict, Iterator
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns; sns.set()
class GaussianEpsilon:
"""Gaussian noise
:param mu: mean value of the noise
:param std: standard deviation of the noise
"""
def __init__(self, mu, std, seed=None):
self.mu = mu
self.std = std
self.rng = np.random.default_rng(seed=seed)
def __next__(self):
return self.rng.normal(self.mu, self.std)
class ZeroEpsilon:
"""Constant noise
:param epsilon: the constant value to be returned
"""
def __init__(self, epsilon=0):
self.epsilon = epsilon
def __next__(self):
return self.epsilon
@dataclass(frozen=True)
class ARModelParams:
"""Parameters of our AR model,
$$s(t+1) = \phi_0 + \phi_1 s(t) + \epsilon.$$
:param delta_t: step size of time in each iteration
:param phi0: pho_0 in the AR model
:param phi1: pho_1 in the AR model
:param epsilon: noise iterator, e.g., Gaussian noise
:param initial_state: a dictionary of the initial state, e.g., `{"s": 1}`
"""
delta_t: float
phi0: float
phi1: float
epsilon: Iterator
initial_state: Dict[str, float]
class AR1Stepper:
"""Stepper that calculates the next step in time in an AR model
:param model_params: parameters for the AR model
"""
def __init__(self, model_params):
self.model_params = model_params
self.current_state = copy.deepcopy(self.model_params.initial_state)
def __iter__(self):
return self
def __next__(self):
phi0 = self.model_params.phi0
phi1 = self.model_params.phi1
epsilon = next(self.model_params.epsilon)
next_s = (
self.model_params.phi0
+ self.model_params.phi1 * self.current_state["s"]
+ epsilon
)
self.current_state = {"s": next_s}
return copy.deepcopy(self.current_state)
def visualize_vr1(delta_t, phi0, phi1, length=200, savefig=False):
mu = 0
std = 0.1
geps = GaussianEpsilon(mu=mu, std=std)
zeps = ZeroEpsilon()
initial_state = {"s": -1}
ar1_params = ARModelParams(
delta_t=delta_t, phi0=phi0, phi1=phi1, epsilon=geps, initial_state=initial_state
)
ar1_params_zero_noise = ARModelParams(
delta_t=delta_t, phi0=phi0, phi1=phi1, epsilon=zeps, initial_state=initial_state
)
ar1_stepper = AR1Stepper(model_params=ar1_params)
ar1_stepper_no_noise = AR1Stepper(model_params=ar1_params_zero_noise)
history = []
history_zero_noise = []
for l in range(length):
history.append(next(ar1_stepper))
history_zero_noise.append(next(ar1_stepper_no_noise))
df = pd.DataFrame(history)
df_zero_noise = pd.DataFrame(history_zero_noise)
fig, ax = plt.subplots(figsize=(10, 6.18))
sns.lineplot(
x=np.linspace(0, length - 1, length) * delta_t,
y=df.s,
ax=ax,
marker=".",
label="AR1",
color="r",
alpha=0.9,
)
sns.lineplot(
x=np.linspace(0, length - 1, length) * delta_t,
y=df_zero_noise.s,
ax=ax,
marker=".",
label="AR1 (wihout Noise)",
color="g",
alpha=0.5,
)
ax.set_title(
f"AR(1) Example ($\phi_0={phi0}$, $\phi_1={phi1}$; $\epsilon$: $\mu={mu}$, $\sigma={std}$; $s(0)={initial_state['s']}$)"
)
ax.set_xlabel("Time")
ax.set_ylabel("Values")
if savefig:
plt.savefig(
f"/work/timeseries-dgp-ar-var/exports/ar1-phi0-{phi0}-phi1-{phi1}-std-{std}-init-{initial_state['s']}.png"
)
Call the function visualize_vr1 to make some plots.
visualize_vr1(delta_t = 0.01, phi0 = 0, phi1 = 1.1, length = 200, savefig=True)
-
Kumar A. Autoregressive (AR) models with Python examples. In: Data Analytics [Internet]. 25 Apr 2022 [cited 11 Aug 2022]. Available: https://vitalflux.com/autoregressive-ar-models-with-python-examples/ ↩
VAR¶
VAR(1)¶
VAR(1) is similar to AR(1) but models time series with interactions between the series. For example, a two-dimensional VAR(1) model is
A more compact form is
Stability of VAR
For VAR(1), our series blows up when the max eigenvalue of the matrix \(\boldsymbol \phi_1\) is large than 11. Otherwise, we get stable series.
In the following examples, we denote the largest eigenvalue of \(\boldsymbol \phi_1\) as \(\lambda_0\).

The figure is created using the code from the "Python Code" tab, and the following parameters.
var_params_stable = VAR1ModelParams(
delta_t = 0.01,
phi0 = np.array([0.1, 0.1]),
phi1 = np.array([
[0.5, -0.25],
[-0.35, 0.45+0.2]
]),
epsilon = ConstantEpsilon(epsilon=np.array([0,0])),
initial_state = np.array([1, 0])
)
var1_visualize(var_params=var_params_stable)

The figure is created using the code from the "Python Code" tab, and the following parameters.
var_params_unstable = VAR1ModelParams(
delta_t = 0.01,
phi0 = np.array([0.1, 0.1]),
phi1 = np.array([
[0.5, -0.25],
[-0.35, 0.45+0.5]
]),
epsilon = ConstantEpsilon(epsilon=np.array([0,0])),
initial_state = np.array([1, 0])
)
var1_visualize(var_params=var_params_unstable)

The figure is created using the code from the "Python Code" tab, and the following parameters.
var_params_no_noise = VAR1ModelParams(
delta_t = 0.01,
phi0 = np.array([-1, 1]),
phi1 = np.array([
[0.7, 0.2],
[0.2, 0.7]
]),
epsilon = ConstantEpsilon(epsilon=np.array([0,0])),
initial_state = np.array([1, 0])
)
var1_visualize(var_params=var_params_no_noise)

The figure is created using the code from the "Python Code" tab, and the following parameters.
var_params_zero_mean_noise = VAR1ModelParams(
delta_t = 0.01,
phi0 = np.array([-1, 1]),
phi1 = np.array([
[0.7, 0.2],
[0.2, 0.7]
]),
epsilon = MultiGaussianNoise(mu=np.array([0, 0]), cov=np.array([[1, 0.5],[0.5, 1]])),
initial_state = np.array([1, 0])
)
var1_visualize(var_params=var_params_zero_mean_noise)

The figure is created using the code from the "Python Code" tab, and the following parameters.
var_params_nonzero_mean_noise = VAR1ModelParams(
delta_t = 0.01,
phi0 = np.array([-1, 1]),
phi1 = np.array([
[0.7, 0.2],
[0.2, 0.7]
]),
epsilon = MultiGaussianNoise(mu=np.array([1, 2]), cov=np.array([[1, 0.5],[0.5, 1]])),
initial_state = np.array([1, 0])
)
var1_visualize(var_params=var_params_nonzero_mean_noise)
import copy
from dataclasses import dataclass
from typing import Iterator
import numpy as np
class MultiGaussianNoise:
"""A multivariate Gaussian noise
:param mu: means of the variables
:param cov: covariance of the variables
:param seed: seed of the random number generator for reproducibility
"""
def __init__(self, mu: np.ndarray, cov: np.ndarray, seed: Optional[float] = None):
self.mu = mu
self.cov = cov
self.rng = np.random.default_rng(seed=seed)
def __next__(self) -> np.ndarray:
return self.rng.multivariate_normal(self.mu, self.cov)
class ConstantEpsilon:
"""Constant noise
:param epsilon: the constant value to be returned
"""
def __init__(self, epsilon=0):
self.epsilon = epsilon
def __next__(self):
return self.epsilon
@dataclass(frozen=True)
class VAR1ModelParams:
"""Parameters of our VAR model,
:param delta_t: step size of time in each iteration
:param phi0: pho_0 in the AR model
:param phi1: pho_1 in the AR model
:param epsilon: noise iterator, e.g., Gaussian noise
:param initial_state: a dictionary of the initial state, e.g., `{"s": 1}`
"""
delta_t: float
phi0: np.ndarray
phi1: np.ndarray
epsilon: Iterator
initial_state: np.ndarray
class VAR1Stepper:
"""Calculate the next values using VAR(1) model.
:param model_params: the parameters of the VAR(1) model, e.g.,
[`VAR1ModelParams`][eerily.data.generators.var.VAR1ModelParams]
"""
def __init__(self, model_params):
self.model_params = model_params
self.current_state = copy.deepcopy(self.model_params.initial_state)
def __iter__(self):
return self
def __next__(self):
epsilon = next(self.model_params.epsilon)
phi0 = self.model_params.phi0
phi1 = self.model_params.phi1
self.current_state = phi0 + np.matmul(phi1, self.current_state) + epsilon
return copy.deepcopy(self.current_state)
class Factory:
"""A generator that creates the data points based on the stepper."""
def __init__(self):
pass
def __call__(self, stepper, length):
i = 0
while i < length:
yield next(stepper)
i += 1
We create a function to visualize the series.
def var1_visualize(var_params):
phi1_eig_max = max(np.linalg.eig(var_params.phi1)[0])
var1_stepper = VAR1Stepper(model_params=var_params)
length = 200
fact = Factory()
history = list(fact(var1_stepper, length=length))
df = pd.DataFrame(history, columns=["s1", "s2"])
print(df.head())
fig, ax = plt.subplots(figsize=(10, 6.18))
sns.lineplot(
x=np.linspace(0, length-1, length) * var_params.delta_t,
y=df.s1,
ax=ax,
marker="o",
)
sns.lineplot(
x=np.linspace(0, length-1, length) * var_params.delta_t,
y=df.s2,
ax=ax,
marker="o",
)
ax.set_title(f"VAR(1) Example ($\lambda_0={phi1_eig_max:0.2f}$)")
ax.set_xlabel("Time")
ax.set_ylabel("Values")
-
Zivot E, Wang J. Modeling Financial Time Series with S-PLUS®. Springer New York; 2006. doi:10.1007/978-0-387-32348-0 ↩
Ended: Kindergarten Models
Synthetic Datasets ↵
Synthetic Time Series¶
Synthetic time series data is useful in time series modeling, such as forecasting.
Real world time series data often comes with complex dynamics in the data generating process. Benchmarking models using real world data often doesn't reflect the special designs in forecasting models. Synthetic time series data provides a good playground for benchmarking models as it can provide useful insights.
Another application of synthetic data is to improve model performance. Synthetic data can be used to augment the training data2 as well as in transfer learning1.
A third application of synthetic data is data sharing without compromising privacy and business secrets3.
Though useful, synthesizing proper artificial time series data can be very complicated as there are an enormous amount of diverse theories associated with time series data. On the other hand, many time series generators are quite universal. For example, GAN can be used to generate realistic time series4.
In this chapter, we will explain the basic ideas and demonstrate our generic programming framework for synthetic time series. With the basics explored, we will focus on a special case of synthetic time series: time series with interactions.
-
Rotem Y, Shimoni N, Rokach L, Shapira B. Transfer learning for time series classification using synthetic data generation. arXiv [cs.LG]. 2022. Available: http://arxiv.org/abs/2207.07897 ↩
-
Bandara K, Hewamalage H, Liu Y-H, Kang Y, Bergmeir C. Improving the Accuracy of Global Forecasting Models using Time Series Data Augmentation. arXiv [cs.LG]. 2020. Available: http://arxiv.org/abs/2008.02663 ↩
-
Lin Z, Jain A, Wang C, Fanti G, Sekar V. Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions. arXiv [cs.LG]. 2019. Available: http://arxiv.org/abs/1909.13403 ↩
-
Leznik M, Michalsky P, Willis P, Schanzel B, Östberg P-O, Domaschka J. Multivariate Time Series Synthesis Using Generative Adversarial Networks. Proceedings of the ACM/SPEC International Conference on Performance Engineering. New York, NY, USA: Association for Computing Machinery; 2021. pp. 43–50. doi:10.1145/3427921.3450257 ↩
Synthetic Time Series¶
With a proper understanding of the DGP, we can build data generators around the DGP we choose.
GluonTS¶
GluonTS is a python package for probabilistic time series modeling. It comes with a simple yet easy-to-use synthetic data generator. For example, to generate a time series of random Gaussian, we only need the following code1.
from gluonts.dataset.artificial import recipe as rcp
g_rg = rcp.RandomGaussian(stddev=2)
g_rp_series = rcp.evaluate(g_rg, 100)
For more complicated multivariate time series, we create recipes for our variables. We steal the example in the GluonTS tutorial.
from gluonts.dataset.artificial import recipe as rcp
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
daily_smooth_seasonality = rcp.SmoothSeasonality(period=288, phase=-72)
noise = rcp.RandomGaussian(stddev=0.1)
signal = daily_smooth_seasonality + noise
recipe = dict(
daily_smooth_seasonality=daily_smooth_seasonality, noise=noise, signal=signal
)
rec_eval = rcp.evaluate(recipe, 500)
fig, ax = plt.subplots(figsize=(10, 6.18))
sns.lineplot(rec_eval)

-
Synthetic data generation. In: GluonTS documentation [Internet]. [cited 13 Nov 2022]. Available: https://ts.gluon.ai/stable/tutorials/data_manipulation/synthetic_data_generation.html ↩
Data Augmentation for Time Series¶
In deep learning, our dataset should help the optimization mechanism locate a good spot in the parameter space. However, real-world data is not necessarily diverse enough that covers the required situations with enough records. For example, some datasets may be extremely imbalanced class labels which leads to poor performance in classification tasks 3. Another problem with a limited dataset is that the trained model may not generalize well 45.
We will cover two topics in this section: Augmenting the dataset and application of the augmented data to model training.
Augmenting the Dataset¶
There are many different ways of augmenting time series data 46. We categorize the methods into the following groups:
- Random transformations, e.g., jittering;
- Pattern mixing, e.g., DBA;7
- Generative models, e.g.,
We also treat the first two methods, random transformations and pattern mixing as basic methods.
Basic Methods¶
In the following table, we group some of the data augmentation methods by two dimensions, the category of the method, and the domain of where the method is applied.
| Projected Domain | Time Scale | Magnitude | |
|---|---|---|---|
| Random Transformation | Frequency Masking, Frequency Warping, Fourier Transform, STFT | Permutation, Slicing, Time Warping, Time Masking, Cropping | Jittering, Flipping, Scaling, Magnitude Warping |
| Pattern Mixing | EMDA12, SFM13 | Guided Warping14 | DFM9, Interpolation, DBA7 |
For completeness, we will explain some of the methods in more detail in the following.
Perturbation in Fourier Domain¶
In the Fourier domain, for each the amplitude \(A_f\) and phase \(\phi_f\) at a specific frequency, we can perform15
- magnitude replacement using a Gaussian distribution, and
- phase shift by adding Gaussian noise.
We perform such perturbations at some chosen frequency.
Slicing, Permutation, and Bootstrapping¶
We can slice a series into small segments. With the slices, we can perform different operations to create new series.
- Window Slicing (WS): In a classification task, we can take the slices from the original series and assign the same class label to the slice 16. The slices can also be interpolated to match the length of the original series 4.
- Permutation: We take the slices and permute them to form a new series 17.
- Moving Block Bootstrapping (MBB): First, we remove the trend and seasonability. Then we draw blocks of fixed length from the residual of the series until the desired length of the series is met. Finally, we combine the newly formed residual with trend and seasonality to form a new series 18.
Warping¶
Both the time scale and magnitude can be warped. For example,
- Time Warping: We distort time intervals by taking a range of data points and upsample or downsample it 6.
- Magnitude Warping: the magnitude of the time series is rescaled.
Dynamic Time Warping (DTW)
Given two sequences, \(S^{(1)}\) and \(S^{(2)}\), the Dynamic Time Warping (DTW) algorithm finds the best way to align two sequences. During this alignment process, we quantify the misalignment using a distance similar to the Levenshtein distance, where the distance between two series \(S^{(1)}_{1:i}\) (with \(i\) elements) and \(S^{(2)}_{1:j}\) (with \(j\) elements) is7
where \(S^{(1)}_i\) is the \(i\)the element of the series \(S^{(1)}\), \(d(x,y)\) is a predetermined distance, e.g., Euclidean distance. This definition reveals the recursive nature of the DTW distance.
Notations in the Definition: \(S_{1:i}\) and \(S_{i}\)
The notation \(S_{1:i}\) stands for a series that contains the elements starting from the first to the \(i\)th in series \(S\). For example, we have a series
The notation \(S^1_{1:4}\) represents
The notation \(S_i\) indicates the \(i\)th element in \(S\). For example,
If we map these two notations to Python,
- \(S_{1:i}\) is equivalent to
S[0:i], and - \(S_i\) is equivalent to
S[i-1].
Note that the indices in Python look strange. This is also the reason we choose to use subscripts not square brackets in our definition.
Levenshtein Distance
Given two words, e.g., \(w^{a} = \mathrm{cats}\) and \(w^{b} = \mathrm{katz}\). Suppose we can only use three operations: insertions, deletions and substitutions. The Levenshtein distance calculates the number of such operations needed to change from the first word \(w^a\) to the second one \(w^b\) by applying single-character edits. In this example, we need two replacements, i.e., "c" -> "k" and "s" -> "z".
The Levenshtein distance can be solved using recursive algorithms 1.
DTW is very useful when comparing series with different lengths. For example, most error metrics require the actual time series and predicted series to have the same length. In the case of different lengths, we can perform DTW when calculating these metrics2.
The forecasting package darts provides a demo of DTW.
DTW Barycenter Averaging
DTW Barycenter Averaging (DBA) constructs a series \(\bar{\mathcal S}\) out of a set of series \(\{\mathcal S^{(\alpha)}\}\) so that \(\bar{\mathcal S}\) is the barycenter of \(\{\mathcal S^{(\alpha)}\}\) measured by Dynamic Time Warping (DTW) distance 7.
Barycenter Averaging Based on DTW Distance¶
Petitjean et al proposed a time series averaging algorithm based on DTW distance which is dubbed DTW Barycenter Averaging (DBA).
DBA Implementation
Series Mixing¶
Another class of data augmentation methods is mixing the series. For example, we take two randomly drawn series and average them using DTW Barycenter Averaging (DBA) 7. (DTW, dynamic time warping, is an algorithm to calculate the distance between sequential datasets by matching the data points on each of the series 719.) To augment a dataset, we can choose from a list of strategies 2021:
- Average All series using different sets of weights to create new synthetic series.
- Average Selected series based on some strategies. For example, Forestier et al proposed choosing an initial series and combining it with its nearest neighbors 21.
- Average Selected with Distance is Average Selected but neighbors that are far from the initial series are down-weighted 21.
Some other similar methods are
- Equalized Mixture Data Augmentation (EMDA) calculates the weighted average of spectrograms of the same class label12.
- Stochastic Feature Mapping (SFM) is a data augmentation method in audio data13.
Data Generating Process¶
Time series data can also be augmented using some assumed data generating process (DGP). Some methods, such as GRATIS 8, utilize simple generic methods such as AR/MAR. Some other methods, such as Gaussian Trees 22, utilize more complicated hidden structures using graphs, which can approximate more complicated data generating processes. These methods do not necessarily reflect the actual data generating process but the data is generated using some parsimonious phenomenological models. Some other methods are more tuned toward detailed mechanisms. There are also methods using generative deep neural networks such as GAN.
Dynamic Factor Model (DFM)¶
For example, we have a series \(X(t)\) which depends on a latent variable \(f(t)\)9,
where \(f(t)\) is determined by a differential equation
In the above equations, \(\eta(t)\) and \(\xi(t)\) are the irreducible noise.
The above two equations can be combined into one first-order differential equation.
Once the model is fit, it can be used to generate new data points. However, we will have to understand whether the data is generated in such processes.
Applying the Synthetic Data to Model Training¶
Once we prepared the synthetic dataset, there are two strategies to include them in our model training 20.
| Strategy | Description |
|---|---|
| Pooled Strategy | Synthetic data + original data -> model |
| Transfer Strategy | Synthetic data -> pre-trained model; pre-trained model + original data -> model |
The pooled strategy takes the synthetic data and original data then feeds them together into the training pipeline. The transfer strategy uses the synthetic data to pre-train the model, then uses transfer learning methods (e.g., freeze weights of some layers) to train the model on the original data.
-
trekhleb. javascript-algorithms/src/algorithms/string/levenshtein-distance at master · trekhleb/javascript-algorithms. In: GitHub [Internet]. [cited 27 Jul 2022]. Available: https://github.com/trekhleb/javascript-algorithms/tree/master/src/algorithms/string/levenshtein-distance ↩
-
Unit8. Metrics — darts documentation. In: Darts [Internet]. [cited 7 Mar 2023]. Available: https://unit8co.github.io/darts/generated_api/darts.metrics.metrics.html?highlight=dtw#darts.metrics.metrics.dtw_metric ↩
-
Hasibi R, Shokri M, Dehghan M. Augmentation scheme for dealing with imbalanced network traffic classification using deep learning. 2019.http://arxiv.org/abs/1901.00204. ↩
-
Iwana BK, Uchida S. An empirical survey of data augmentation for time series classification with neural networks. 2020.http://arxiv.org/abs/2007.15951. ↩↩↩
-
Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. Journal of Big Data 2019; 6: 1–48. ↩
-
Wen Q, Sun L, Yang F, Song X, Gao J, Wang X et al. Time series data augmentation for deep learning: A survey. 2020.http://arxiv.org/abs/2002.12478. ↩↩
-
Petitjean F, Ketterlin A, Gançarski P. A global averaging method for dynamic time warping, with applications to clustering. Pattern recognition 2011; 44: 678–693. ↩↩↩↩↩↩
-
Kang Y, Hyndman RJ, Li F. GRATIS: GeneRAting TIme series with diverse and controllable characteristics. 2019.http://arxiv.org/abs/1903.02787. ↩↩
-
Stock JH, Watson MW. Chapter 8 - dynamic factor models, Factor-Augmented vector autoregressions, and structural vector autoregressions in macroeconomics. In: Taylor JB, Uhlig H (eds). Handbook of macroeconomics. Elsevier, 2016, pp 415–525. ↩↩↩
-
Yoon J, Jarrett D, Schaar M van der. Time-series generative adversarial networks. In: Wallach H, Larochell H, Beygelzime A, Buc F dAlche, Fox E, Garnett R (eds). Advances in neural information processing systems. Curran Associates, Inc., 2019https://papers.nips.cc/paper/2019/hash/c9efe5f26cd17ba6216bbe2a7d26d490-Abstract.html. ↩
-
Brophy E, Wang Z, She Q, Ward T. Generative adversarial networks in time series: A survey and taxonomy. 2021.http://arxiv.org/abs/2107.11098. ↩
-
Takahashi N, Gygli M, Van Gool L. AENet: Learning deep audio features for video analysis. 2017.http://arxiv.org/abs/1701.00599. ↩↩
-
Cui X, Goel V, Kingsbury B. Data augmentation for deep neural network acoustic modeling. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). 2014, pp 5582–5586. ↩↩
-
Iwana BK, Uchida S. Time series data augmentation for neural networks by time warping with a discriminative teacher. 2020.http://arxiv.org/abs/2004.08780. ↩
-
Gao J, Song X, Wen Q, Wang P, Sun L, Xu H. RobustTAD: Robust time series anomaly detection via decomposition and convolutional neural networks. 2020.http://arxiv.org/abs/2002.09545. ↩
-
Le Guennec A, Malinowski S, Tavenard R. Data augmentation for time series classification using convolutional neural networks. In: ECML/PKDD workshop on advanced analytics and learning on temporal data. 2016https://halshs.archives-ouvertes.fr/halshs-01357973/document. ↩
-
Um TT, Pfister FMJ, Pichler D, Endo S, Lang M, Hirche S et al. Data augmentation of wearable sensor data for parkinson’s disease monitoring using convolutional neural networks. 2017.http://arxiv.org/abs/1706.00527. ↩
-
Bergmeir C, Hyndman RJ, Benı́tez JM. Bagging exponential smoothing methods using STL decomposition and Box–Cox transformation. International journal of forecasting 2016; 32: 303–312. ↩
-
Hewamalage H, Bergmeir C, Bandara K. Recurrent neural networks for time series forecasting: Current status and future directions. 2019.http://arxiv.org/abs/1909.00590. ↩
-
Bandara K, Hewamalage H, Liu Y-H, Kang Y, Bergmeir C. Improving the accuracy of global forecasting models using time series data augmentation. 2020.http://arxiv.org/abs/2008.02663. ↩↩
-
Forestier G, Petitjean F, Dau HA, Webb GI, Keogh E. Generating synthetic time series to augment sparse datasets. In: 2017 IEEE international conference on data mining (ICDM). 2017, pp 865–870. ↩↩↩
-
Cao H, Tan VYF, Pang JZF. A parsimonious mixture of gaussian trees model for oversampling in imbalanced and multimodal time-series classification. IEEE transactions on neural networks and learning systems 2014; 25: 2226–2239. ↩
Ended: Synthetic Datasets
Forecasting ↵
Time Series Forecasting Tasks¶
There are many different types of time series forecasting tasks. Forecasting tasks can be categorized by different criteria. For example, we can categorize them by the number of variables in the series and their relations to each other.
In the introduction of this chapter, we already discussed some terminologies of time series forecasting. In this section, we dive deep into the details of univariate time series forecasting and multivariate time series forecasting.
Forecasting Univariate Time Series¶
In a univariate time series forecasting task, we are given a single time series and asked to forecast future steps of the series.

Given a time series \(\{y_{t}\}\), we train a model to forecast \(\color{red}y_{t+1:t+H}\) using input \(\color{blue}y_{t-K:t}\), i.e., we build a model \(f\) such that

Forecasting Multivariate Time Series¶
In a multivariate time series forecasting task, we will deal with multiple time series. Naively, we expect multivariate time series forecasting to be nothing special but adding more series. However, the complication comes from the fact that different series may not be aligned well at all time steps.
In the introduction of this chapter, we have shown the basic ideas of targets \(\mathbf y\) and covariates \(\mathbf x\) and \(\mathbf u\). In the following illustration, we expand the idea to the multivariate case.

Naive Forecasts¶
In some sense, time series forecasting is easy, if we have low expectations. From a dynamical system point of view, our future is usually not too different from our current state.
Last Observation¶
Assuming our time series is not changing dramatically, we can take our last observation as our forecast.
Example: Last Observation as Forecast
Assuming we have the simplest dynamical system,
where \(y(t)\) is the time series generator function, \(t\) is time, \(\theta\) is some parameters defining the function \(f\).
For example,
is a linear growing time series.
We would imagine, it won't be too crazy if we just take the last observed value as our forecast.

import matplotlib.pyplot as plt
from darts.utils.timeseries_generation import linear_timeseries
ts = linear_timeseries(length=30)
ts.plot(marker=".")
ts_train, ts_test = ts.split_before(0.9)
ts_train.plot(marker=".", label="Train")
ts_test.plot(marker="+", label="Test")
ts_last_value_naive_forecast = ts_train.shift(1)[-1]
fig, ax = plt.subplots(figsize=(10, 6.18))
ts_train.plot(marker=".", label="Train", ax=ax)
ts_test.plot(marker="+", label="Test", ax=ax)
ts_last_value_naive_forecast.plot(marker=".", label="Last Value Naive Forecast")
There are also slightly more complicated naive forecasting methods.
Mean Forecast¶
In some bounded time series, the mean of the past values is also a good naive candidate1.
Example: Naive Mean Forecast

import matplotlib.pyplot as plt
from darts.utils.timeseries_generation import sine_timeseries
from darts.models.forecasting.baselines import NaiveMean
ts_sin = sine_timeseries(length=30, value_frequency=0.05)
ts_sin.plot(marker=".")
ts_sin_train, ts_sin_test = ts_sin.split_before(0.9)
ts_sin_train.plot(marker=".", label="Train")
ts_sin_test.plot(marker="+", label="Test")
naive_mean_model = NaiveMean()
naive_mean_model.fit(ts_sin_train)
ts_mean_naive_forecast = naive_mean_model.predict(1)
fig, ax = plt.subplots(figsize=(10, 6.18))
ts_sin_train.plot(marker=".", label="Train", ax=ax)
ts_sin_test.plot(marker="+", label="Test", ax=ax)
ts_mean_naive_forecast.plot(marker=".", label="Naive Mean Forecast")
Simple Exponential Smoothing¶
Simple Exponential Smoothing (SES) is a naive smoothing method to account for the historical values of a time series when forecasting. The expanded form of SES is1
Truncated SES is Biased
Naively speaking, if history is constant, we have to forecast the same constant. For example, if we have \(y(t) = y(t_0)\), the smoothing
should equal to \(y(t_0)\), i.e.,
The series indeed sums up to \(1/\alpha\) when \(n\to\infty\) since
However, if we truncate the series to finite values, we will have
Then our naive forecast for constant series is
when \(y(t_0)\) is positive.
As an intuition, we plot out the sum of the coefficients for different orders and \(\alpha\)s.

from itertools import product
import pandas as pd
import seaborn as sns; sns.set()
from matplotlib.colors import LogNorm
def ses_coefficients(alpha, order):
return (
np.power(
np.ones(int(order)) * (1-alpha), np.arange((order))
) * alpha
)
alphas = np.linspace(0.05, 0.95, 19)
orders = list(range(1, 16))
# Create dataframes for visualizations
df_ses_coefficients = pd.DataFrame(
[[alpha, order] for alpha, order in product(alphas, orders)],
columns=["alpha", "order"]
)
df_ses_coefficients["ses_coefficients_sum"] = df_ses_coefficients.apply(
lambda x: ses_coefficients(x["alpha"], x["order"]).sum(), axis=1
)
# Visualization
g = sns.heatmap(
data=df_ses_coefficients.pivot(
"alpha", "order", "ses_coefficients_sum"
),
square=True, norm=LogNorm(),
fmt="0.2g",
yticklabels=[f"{i:0.2f}" for i in alphas],
)
g.set_title("SES Sum of Coefficients");
Holt-Winters' Exponential Smoothing
In applications, the Holt-Winters' exponential smoothing is more practical123.
We created some demo time series and apply the Holt-Winters' exponential smoothing. To build see where exponential smoothing works, we forecast at different dates.





import matplotlib.pyplot as plt
from darts.utils.timeseries_generation import sine_timeseries
from darts.models.forecasting.baselines import NaiveMean
ts_sin = sine_timeseries(length=30, value_frequency=0.05)
ts_sin.plot(marker=".")
ts_sin_train, ts_sin_test = ts_sin.split_before(0.7)
es_model = ExponentialSmoothing()
es_model.fit(ts_sin_train)
es_model_sin_forecast = es_model.predict(4)
fig, ax = plt.subplots(figsize=(10, 6.18))
ts_sin_train.plot(marker=".", label="Train", ax=ax)
ts_sin_test.plot(marker="+", label="Test", ax=ax)
es_model_sin_forecast.plot(marker=".", label="Exponential Smoothing Forecast")
Other¶
Other naive forecasts, such as naive drift, are introduced in Hyndman, et al., (2021)1.
-
Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. OTexts.com/fpp3. Accessed on 2023-02-13. ↩↩↩↩
-
6.4.3.5. Triple Exponential Smoothing. In: NIST Engineering Statistics Handbook [Internet]. [cited 16 Feb 2023]. Available: https://www.itl.nist.gov/div898/handbook/pmc/section4/pmc435.htm ↩
-
Example: Holt-Winters Exponential Smoothing — NumPyro documentation. In: NumPyro [Internet]. [cited 16 Feb 2023]. Available: https://num.pyro.ai/en/stable/examples/holt_winters.html ↩
Ended: Forecasting
Evaluation and Metrics ↵
Time Series Forecasting Evaluation¶
Evaluating time series forecasting models is very important yet sometimes difficult. For example, it is very easy to bring in information leakage when evaluating time series forecasting models. In this section, we will discuss some of the common pitfalls and best practices when evaluating time series forecasting models.
Train Test Split for Time Series Data¶
Evaluating time series models is usually different from most other machine learning tasks as we usually don't have strictly i.i.d. data. On the other hand, the time dimension in our time series data is a natural dimension to split the data into train and test sets. In this section, we will discuss the different ways to split the data into train and test sets.
Backtesting¶
We choose the specific time step to split the data, assuming the series used to train the model is \(Y_t\) with length \(T_t\), and the series used to evaluate the model is \(Y_e\) with length \(T_e\).
Slicing the Training Data \(Y_t\) for Training
In many deep learning models, the input length and output length are fixed. To train the model using time series data, we usually apply the time delayed embedding method to prepare the train data \(Y_t\). Refer to our deep learning forecasting examples for more details.
Keeping the length of the train and test unchanged, we can move forward in time, where we require the split time point to fall inside the window1. In this way, we create multiple train test splits and perform multiple evaluations, or backtesting, on the model. The following illustration shows an example of this technique. The uppermost panel shows the original series, with each block indicating a time step.

Expanding the Length of the Training Set
Keeping the train set length equal when sliding through time is to simulate the use case where we always take a fixed length of historical data to train the model. For example, some dataset has gradual data shift, and using fixed length historical data can help relieviate data shift problems.
In some other use cases, we would take in as much data as possible. In this case, we could also expand the train set when silidng the window.

In some use cases, we do not get to know the most recent data when we are performing inference. For example, if we are forecasting the demand for the next week, we might not know the demand for the last week as the data might be ready. In this case, we can also use a gap between the train and test sets to simulate this situation.

Using Multiple Time Windows¶
When we move the split time point forward in time, we could constrain the split to fall inside a specific time window. In the following, we have assumed the sliding window size to be 4, where we move forward one time step for each test set.

For some large time series forecasting datasets, we might be interested in the performance of some specific types of periods. For example, if Amazon is evaluating its demand forecasting model, it may be more interested in the performance of the model during some normal days as well as the holiday seasons. In this case, we can use multiple sliding windows to evaluate the model.
Cross-validation¶
Not a Common Practice
This is not a common practice in time series forecasting problems but it is still worth mentioning here.
If our dataset is i.i.d. through time and we do not have information leakage if we randomly take a subsequence of the data, we could also use the cross-validation technique to evaluate the model. The following illustration shows an example of this technique. The uppermost panel shows the original series, with each block indicating a time step.

Similar to the gap technique in backtesting, we can also use a gap between the train and test sets.

-
Cerqueira V, Torgo L, Mozetic I. Evaluating time series forecasting models: An empirical study on performance estimation methods. 2019.http://arxiv.org/abs/1905.11744. ↩
Time Series Forecasting Metrics¶
Measuring the goodness of a forecaster is nontrivial. Tons of metrics are devised to measure forecasting results, applying the wrong metric may lead to "consequences" in decisions.
In the following discussion, we assume the forecast at time \(t\) to be \(\hat y(t)\) and the actual value is \(y(t)\). The forecast horizon is defined as \(H\). In general, we look for a function
where \(\{C(t)\}\) are the covariates and \(\{y(t)\}\) represents the past target variables.
Distance between Truths and Forecasts
Naive choices of such metrics are distances between the truth vector \(\{y(t_1), \cdots, y(t_H)\}\) and the forecast vector \(\{\hat y(t_1), \cdots, \hat y(t_H)\}\).
For example, we can use norms of the deviation vector \(\{y(t_1) - \hat y(t_1), \cdots, y(t_H) - \hat y(t_H)\}\).
In Hyndman & Koehler (2006), \(y(t_i) - \hat y(t_i)\) is defined as the forecast error \(e_i\equiv y(t_i) - \hat y(t_i)\)4. While it is a bit confusing, the term forecast error is used in many kinds of literature.
The authors also defined the relative error \(r_i = e_i/e^*_i\) with \(e^*_i\) being the reference forecast error from the baseline.
In this section, we explore some frequently used metrics. Hyndman & Koehler (2006) discussed four different types of metrics4
- scaled-dependent measures, e.g., errors based on \(\{y(t_1) - \hat y(t_1), \cdots, y(t_H) - \hat y(t_H)\}\),
- percentage errors, e.g., errors based on \(\{\frac{y(t_1) - \hat y(t_1)}{y(t_1)}, \cdots, \frac{y(t_H) - \hat y(t_H)}{y(t_H)}\}\),
- relative errors, e.g., errors based on \(\{\frac{y(t_1) - \hat y(t_1)}{y(t_1) - \hat y^*(t_1)}, \cdots, \frac{y(t_H) - \hat y(t_H)}{y(t_H) - \hat y^*(t_H)}\}\), where \(\hat y^*(t_i)\) is a baseline forecast at time \(t_i\),
- relative metrics, e.g., the ratio of the MAE for the experiment and the baseline, \(\operatorname{MAE}/\operatorname{MAE}_{\text{baseline}}\),
- in-sample scaled errors, e.g., MASE.
Apart from the above categories, there are some other properties of metrics. Some metrics are bounded while others are not. Also, some metrics specifically require probabilistic forecasts. In the following table, we list some of the useful metrics.
| Metric | Probabilistic | Theoretical Range | Notes |
|---|---|---|---|
| MAE | \([0,\infty)\) | ||
| MSE | \([0,\infty)\) | ||
| RMSE | \([0,\infty)\) | ||
| MASE | \([0,\infty)\) | Scaled in practice; requires insample data | |
| RMSLE | \([0,\infty)\) | ||
| MAPE | \([0,\infty]\) | ||
| sMAPE | \([0, 2]\) | For values of the same sign | |
| wMAPE | - | Depends on what weights are used | |
| Quantile Score | \([0,\infty)\) | ||
| CRPS | \([0,\infty)\) |
Recommended Reading
Hyndman & Athanasopoulos (2021) is a good reference for forecast errors 1.
To find implementations of metrics, Darts and GluonTS both have a handful of metrics implemented.
List of Metrics¶
Code to Reproduce the Results
# %%
from loguru import logger
import datetime
import numpy as np
from itertools import product
from matplotlib.ticker import FormatStrFormatter
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from darts.utils.timeseries_generation import (
sine_timeseries,
linear_timeseries,
constant_timeseries,
)
from darts.metrics.metrics import (
mae,
mape,
marre,
mse,
ope,
rho_risk,
rmse,
rmsle,
smape,
mase
)
# %%
length = 600
start = 0
ts_sin = sine_timeseries(length=length, value_frequency=0.01, start=0)
ts_lin = linear_timeseries(
length=length, start_value=0, end_value=1.5, start=0
)
ts = (ts_sin + ts_lin).with_columns_renamed("sine", "sin+linear")
split_at = 500
ts_train, ts_test = ts.split_before(split_at)
ts_train = ts_train.with_columns_renamed("sin+linear", "train")
ts_test = ts_test.with_columns_renamed("sin+linear", "actual")
_, ts_pred_lin = ts_lin.split_before(split_at)
ts_pred_lin = ts_pred_lin.with_columns_renamed("linear", "linear_prediction")
_, ts_pred_sin = ts_sin.split_before(split_at)
ts_pred_sin = ts_pred_sin.with_columns_renamed("sine", "sin_prediction")
ts_pred_const = constant_timeseries(
value=ts_train.last_value(),
start=ts_test.start_time(),
end=ts_test.end_time()
)
ts_pred_const = ts_pred_const.with_columns_renamed("constant", "constant_prediction")
# %%
ts.plot(marker=".")
# ts_lin.plot(linestyle="dashed")
# ts_sin.plot(linestyle="dashed")
ts_train.plot()
ts_test.plot(color="r")
ts_pred_lin.plot(color="orange")
ts_pred_sin.plot(color="green")
ts_pred_const.plot(color="black")
# %%
class MetricBench:
def __init__(self, metric_fn, metric_name=None):
self.metric_fn = metric_fn
if metric_name is None:
metric_name = self.metric_fn.__name__
self.metric_name = metric_name
def _heatmap_data(self, actual_range=None, pred_range=None):
if actual_range is None:
actual_range = np.linspace(-1,1, 21)
if pred_range is None:
pred_range = np.linspace(-1,1, 21)
hm_data = []
for y, y_hat in product(actual_range, pred_range):
ts_y = constant_timeseries(value=y, length=1)
ts_y_hat = constant_timeseries(value=y_hat, length=1)
try:
hm_data.append(
{
"y": y,
"y_hat": y_hat,
"metric": f"{self.metric_name}",
"value": self.metric_fn(ts_y, ts_y_hat)
}
)
except Exception as e:
logger.warning(f"Skipping due to {e}")
df_hm_data = pd.DataFrame(hm_data)
return df_hm_data
def heatmap(self, ax=None, cmap="viridis_r", actual_range=None, pred_range=None):
if ax is None:
fig, ax = plt.subplots(figsize=(12, 10))
df_hm_data = self._heatmap_data(actual_range=actual_range, pred_range=pred_range)
sns.heatmap(
df_hm_data.pivot("y_hat", "y", "value"),
fmt=".2g",
cmap=cmap,
ax=ax,
)
ax.set_xticklabels(
[self._heatmap_fmt(label.get_text())
for label in ax.get_xticklabels()]
)
ax.set_yticklabels(
[self._heatmap_fmt(label.get_text())
for label in ax.get_yticklabels()]
)
ax.set_title(f"Metric: {self.metric_name}")
return ax
@staticmethod
def _heatmap_fmt(s):
try:
n = "{:.2f}".format(float(s))
except:
n = ""
return n
def _ratio_data(self, pred_range=None):
if pred_range is None:
pred_range = np.linspace(-1, 3, 41)
ratio_data = []
y = 1
for y_hat in pred_range:
ts_y = constant_timeseries(value=y, length=1)
ts_y_hat = constant_timeseries(value=y_hat, length=1)
try:
ratio_data.append(
{
"y": y,
"y_hat": y_hat,
"metric": f"{self.metric_name}",
"value": self.metric_fn(ts_y, ts_y_hat)
}
)
except Exception as e:
logger.warning(f"Skipping due to {e}")
df_ratio_data = pd.DataFrame(ratio_data)
return df_ratio_data
def ratio_plot(self, ax=None, color="k", pred_range=None):
if ax is None:
fig, ax = plt.subplots(figsize=(12, 10))
df_ratio_data = self._ratio_data(pred_range=pred_range)
sns.lineplot(df_ratio_data, x="y_hat", y="value", ax=ax)
ax.set_title(f"Metric {self.metric_name} (y=1)")
return ax
# %% [markdown]
# ## Norms (MAE, MSE)
# %%
mse_bench = MetricBench(metric_fn=mse)
mse_bench.heatmap()
# %%
mse_bench.ratio_plot()
# %%
mae_bench = MetricBench(metric_fn=mae)
mae_bench.heatmap()
# %%
mae_bench.ratio_plot()
# %%
rmsle_bench = MetricBench(metric_fn=rmsle)
rmsle_bench.heatmap()
# %%
rmsle_bench.ratio_plot()
# %%
rmse_bench = MetricBench(metric_fn=rmse)
rmse_bench.heatmap()
# %%
rmse_bench.ratio_plot()
# %%
y_pos = np.linspace(0.1, 1, 20)
mape_bench = MetricBench(metric_fn=mape)
mape_bench.heatmap(actual_range=y_pos)
# %%
mape_bench.ratio_plot()
# %%
smape_bench = MetricBench(metric_fn=smape)
smape_bench.heatmap(actual_range=y_pos, pred_range=y_pos)
# %%
smape_bench.ratio_plot(pred_range=y_pos)
# %% [markdown]
# ## Naive MultiHorizon Forecasts
# %%
two_args_metrics = [
mse, mae, rmse, rmsle, mape, smape
]
insample_metrics = [mase]
metrics_tests = []
for m in two_args_metrics:
metrics_tests.append(
{
"metric": m.__name__,
"value_lin_pred": m(ts_test, ts_pred_lin),
"value_sin_pred": m(ts_test, ts_pred_sin),
"value_const_pred": m(ts_test, ts_pred_const)
}
)
for m in insample_metrics:
metrics_tests.append(
{
"metric": m.__name__,
"value_lin_pred": m(ts_test, ts_pred_lin, insample=ts_train),
"value_sin_pred": m(ts_test, ts_pred_sin, insample=ts_train),
"value_const_pred": m(ts_test, ts_pred_const, insample=ts_train)
}
)
df_metrics_tests = (
pd.DataFrame(metrics_tests)
.round(3)
.set_index("metric")
.sort_values(by="value_const_pred")
.sort_values(by="mape", axis=1)
)
df_metrics_tests.rename(
columns={
"value_lin_pred": "Linear Prediction",
"value_sin_pred": "Sine Prediction",
"value_const_pred": "Last Observed"
},
inplace=True
)
df_metrics_tests
# %%
from matplotlib.colors import LogNorm
# %%
metrics_tests_min_value = df_metrics_tests.min().values.min()
metrics_tests_max_value = np.ma.masked_invalid(df_metrics_tests.max()).max()
metrics_tests_min_value, metrics_tests_max_value
# %%
fig, ax = plt.subplots(figsize=(10, 6.18))
sns.heatmap(
df_metrics_tests,
norm=LogNorm(
vmin=0.1,
vmax=100
),
cbar_kws={"ticks":[0,1,10,1e2]},
vmin = 0.1, vmax=100,
annot=True,
fmt="0.3f",
ax=ax,
)
1-Norm: MAE¶
The Mean Absolute Error (MAE) is
We can check some special and extreme cases:
-
All forecasts are zeros: \(\hat y(t)=0\), the value of \(\operatorname{MAE}(y, \hat y)\) is determined by the true value \(y\).
The Interpretation of MAE is Scale Dependent
This also tells us that MAE depends on the scale of the true values: MAE value of \(100\) for larger true values such as true value \(y=1000\) with forecast \(\hat y=900\) doesn't seem to be bad, but MAE value for smaller true values such as true value \(y=100\) with forecast \(\hat y=0\) seems to be quite off. Of course, the actual perception depends on the problem we are solving.
This brings in a lot of trouble when we are dealing with forecasts on different scales, such as sales forecasts for all kinds of items on an e-commerce platform. Different types of items, e.g., expensive watches vs cheat T-shirts, have very different sales. In fact, in a paper from Amazon, the sales on Amazon are even scale-free5.
-
All forecasts are infinite: \(\hat y=\infty\), the MAE value will also be \(\infty\). This means MAE is not bounded.


2-Norm: MSE¶
The Mean Square Error (MSE) is
Similar to MAE, the interpretation of MSE is also scale dependent and the value is unbounded. However, due to the \({}^2\), MSE can be really large or small. Obtaining insights from MSE is even harder than MAE in most situations unless MSE matches a meaningful quantity in the dynamical system we are forecasting. Nevertheless, we can know that large deviations (\(\lvert y(t) - \hat y(t)\rvert \gg 1\)) dominates the metric even more than MAE.


Other Norms
Other norms are not usually seen in literature but might provide insights into forecasts.
The Max Norm error of a forecast can be defined as2
RMSE¶
The Root Mean Square Error (RMSE) is
RMSE essentially brings the scale of the metric from the MSE scale back to something similar to MAE. However, we have to be mindful that large deviations dominate the metric more than that in MAE.
Domination by Large Deviations
For example, in a horizon 2 forecasting problem, suppose we have the true values \([100, 1]\) and we forecast \([0, 0]\)
If we assume the second step is forecasted perfectly, i.e., forecasts \([0, 1]\), we have almost the same RMSE
For MAE, assuming forecasts \([0,0]\), we get
If we assume the forecast \([0,1]\), we get something slightly different
To see the difference between RMSE and MAE visually, we compute the following quantities
as well as
Using these ratios, we investigate the contributions from the large deviations for MAE and RMSE.

import numpy as np
from darts.metrics.metrics import mae, rmse
from darts import TimeSeries
metric_contrib_x = np.linspace(0, 50, 101)
mae_contrib_ratio = []
for i in metric_contrib_x:
mae_contrib_ratio.append(
mae(
TimeSeries.from_values(np.array([i,1,])),
TimeSeries.from_values(np.array([0,1,])),
)/mae(
TimeSeries.from_values(np.array([i,1,])),
TimeSeries.from_values(np.array([0,0,])),
)
)
rmse_contrib_ratio = []
for i in metric_contrib_x:
rmse_contrib_ratio.append(
rmse(
TimeSeries.from_values(np.array([ i, 1,])),
TimeSeries.from_values(np.array([0, 1,])),
)/rmse(
TimeSeries.from_values(np.array([i, 1,])),
TimeSeries.from_values(np.array([0,0])),
)
)
The above chart shows that RMSE is more dominated by large deviations.
MASE¶
The Mean Absolute Scaled Error (MASE) is the MAE scaled by the one-step ahead naive forecast error on the training data (\(\{y(t_i)\}\), with \(i\in {1, \cdots, T}\))3
Due to the scaling by the one-step ahead naive forecast, MASE is easier to interpret. If MASE is large, the deviation in our forecasts is comparable to the rough scale of the time series. Naively, we expect a good MASE to be smaller than 1.
RMSLE¶
The Root Mean Squared Log Error (RMSLE) is


MAPE¶
The Mean Absolute Percent Error (MAPE) is a bounded metric defined as


sMAPE¶
The symmetric Mean Absolute Percent Error (sMAPE) is a symmetrized version of MAPE


sMAPE is Bounded but Hard to Get a Feeling
Even though sMAPE is bounded and it solves the blow-up problem in MAPE, it is dangerous to use sMAPE alone. For example, given true values \([1]\), forecasting \([10]\) gives us sMAPE value \(1.636\); Forecasting \([100]\) gives us sMAPE value \(1.960\); Forecasting \([1000]\) gives us sMAPE value \(1.996\). The later are not too different judging by the sMAPE values.
That being said, as the sMAPE value gets a bit larger, it is hard to get stable intuitions on how bad the forecast is.
wMAPE¶
The weighted Mean Absolute Percent Error (wMAPE) is
Quantile Loss¶
The Quantile loss is defined as 678
where \({}_{+}\) indicates that we only take positive values.
Quantile Loss has many names
The quantile loss is also called quantile score, pinball loss, quantile risk or \(\rho\)-risk.
Other Metrics¶
We do not have a full collection of all metrics available. But we also explain some more complicated metrics, e.g., CRPS, as individual sections.
Metrics Applied on a Toy Problem¶
To feel the difference between each metric, we assume a simple forecasting problem with some artificial time series data.
We construct the artificial data by summing a sine series and a linear series.

We have prepared three naive forecasts,
- forecasting constant values using the last observation,
- forecasting the sin component of the actual data,
- forecasting the linear component of the actual data.
We calculated the metrics for the three different scenarios.

-
Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. OTexts.com/fpp3. Accessed on 2022-11-27. ↩
-
Contributors to Wikimedia projects. Uniform norm. In: Wikipedia [Internet]. 23 Oct 2022 [cited 5 Mar 2023]. Available: https://en.wikipedia.org/wiki/Uniform_norm ↩
-
Contributors to Wikimedia projects. Mean absolute scaled error. In: Wikipedia [Internet]. 11 Jan 2023 [cited 5 Mar 2023]. Available: https://en.wikipedia.org/wiki/Mean_absolute_scaled_error ↩
-
Hyndman RJ, Koehler AB. Another look at measures of forecast accuracy. International journal of forecasting 2006; 22: 679–688. ↩↩
-
Salinas D, Flunkert V, Gasthaus J. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. 2017.http://arxiv.org/abs/1704.04110. ↩
-
Gneiting T. Quantiles as optimal point forecasts. International journal of forecasting 2011; 27: 197–207. ↩
-
Koenker R, Bassett G. Regression quantiles. Econometrica: journal of the Econometric Society 1978; 46: 33–50. ↩
-
Vargas Staudacher JMR de, Wu B, Struss C, Mettenleiter N. Uncertainty quantification and probabilistic forecasting of big data time series at amazon supply chain. TUM Data Innovation Lab, 2022https://www.mdsi.tum.de/fileadmin/w00cet/di-lab/pdf/Amazon\SS2022\Final\Report.pdf. ↩
Continuous Ranked Probability Score (CRPS)¶
The Continuous Ranked Probability Score, aka CRPS, is a score to measure how a proposed distribution approximates the data, without knowledge about the true distributions of the data.
Definition¶
CRPS is defined as1
where
- \(x_a\) is the true value of \(x\),
- P(x) is our proposed cumulative distribution for \(x\),
- \(H(x)\) is the Heaviside step function,
- \(\lVert \cdot \rVert_2\) is the L2 norm.
Heaviside Step Function
Explain it¶
The formula looks abstract on first sight, but it becomes crystal clear once we understand it.
Note that the distribution that corresponds to a Heaviside CDF is the delta function \(\delta(x-x_a)\). What this score is calculating is the difference between our distribution and a delta function. If we have a model that minimizes CRPS, then we are looking for a distribution that is close to the delta function \(\delta(x-x_a)\). In other words, we want our distribution to be large around \(x_a\).
To illustrate what the integrand \(\lVert P(x) - H(x - x_a) \rVert_2\) means, we apply some shades to the integrand of the integral in CRPS. We visualize four difference scenarios.
Scenario 1: The predicted CDF \(P(x)\) is reaching 1 very fast.

Scenario 2: The predicted CDF \(P(x)\) is reaching 1 quite late.

Scenario 3: The predicted CDF \(P(x)\) is reaching 1 around the Heaviside function jump.

Scenario 4: The predicted CDF \(P(x)\) is steadily increasing but very dispersed.

With the four different scenarios visualized, intuitively, the only way to get a small CRPS score is to choose a distribution that is focused around \(x_a\). Echoing a previous note on the delta function being the density function of the Heaviside function, we expect a small CRPS reflects a scenario of the following: the predicted distribution \(\rho(x)\) is very focused around the observation \(x_a\).

Discussions¶
Gebetsberger et al found that CRPS is more robust compared to Likelihood while producing similar results if we use a good assumption for the data distribution3.
CRPS is also very useful in time series forecasting. For example, the integrand of CRPS can be used as the loss function in model training 2.
-
Hersbach H. Decomposition of the Continuous Ranked Probability Score for Ensemble Prediction Systems. Weather Forecast. 2000;15: 559–570. doi:10.1175/1520-0434(2000)015<0559:DOTCRP>2.0.CO;2 ↩
-
Gouttes A, Rasul K, Koren M, Stephan J, Naghibi T. Probabilistic Time Series Forecasting with Implicit Quantile Networks. arXiv [cs.LG]. 2021. doi:10.1109/icdmw.2017.19 ↩
-
Gebetsberger M, Messner JW, Mayr GJ, Zeileis A. Estimation Methods for Nonhomogeneous Regression Models: Minimum Continuous Ranked Probability Score versus Maximum Likelihood. Mon Weather Rev. 2018;146: 4323–4338. doi:10.1175/MWR-D-17-0364.1 ↩
Ended: Evaluation and Metrics
Hierarchical Time Series ↵
Hierarchical Time Series Data¶
Many real-world time series data assert some internal structure among the series. For example, the dataset used in the M5 competition is the sales data of different items but with the store and category information provided1. For simplicity, we simplified the dataset to only include the hierarchy of stores.
The simplified dataset can be found here. The original data can be found on the website of IIF. In this simplified version of the M5 dataset, we have the following hierarchy.
flowchart LR
top["Total Sales"]
ca["Sales in California"]
tx["Sales in Texas"]
wi["Sales in Wisconsin"]
top --- ca
top --- tx
top --- wi
subgraph California
ca1["Sales in Store #1 in CA"]
ca2["Sales in Store #2 in CA"]
ca3["Sales in Store #3 in CA"]
ca4["Sales in Store #4 in CA"]
ca --- ca1
ca --- ca2
ca --- ca3
ca --- ca4
end
subgraph Texas
tx1["Sales in Store #1 in TX"]
tx2["Sales in Store #2 in TX"]
tx3["Sales in Store #3 in TX"]
tx --- tx1
tx --- tx2
tx --- tx3
end
subgraph Wisconsin
wi1["Sales in Store #1 in WI"]
wi2["Sales in Store #2 in WI"]
wi3["Sales in Store #3 in WI"]
wi --- wi1
wi --- wi2
wi --- wi3
end
The above tree is useful when thinking about the hierarchies. For example, it explicitly tells us that the sales in stores #1, #2, #3 in TX should sum up to the sales in TX.
We plotted the sales in CA as well as the individual stores in CA. We can already observe some synchronized anomalies.


Summing Matrix¶
The relations between the series is represented using a summing matrix \(\mathbf S\), which connects the bottom level series \(\mathbf b\) and all the possible levels \(\mathbf s\)2
If our forecasts satisfy this relation, we claim our forecasts to be coherent2.
Summing Matrix Example
We take part of the above dataset and only consider the hierarchy of states,
The hierarchy is also revealed in the following tree.
flowchart TD
top["Total Sales"]
ca["Sales in California"]
tx["Sales in Texas"]
wi["Sales in Wisconsin"]
top --- ca
top --- tx
top --- wi
In this example, the bottom level series are denoted as
and all the possible levels are denoted as
The summing matrix is
-
Makridakis S, Spiliotis E, Assimakopoulos V. The M5 competition: Background, organization, and implementation. Int J Forecast. 2022;38: 1325–1336. doi:10.1016/j.ijforecast.2021.07.007 ↩
-
Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. OTexts.com/fpp3. Accessed on 2022-11-27. ↩↩
Hierarchical Time Series Reconciliation¶
Reconciliation is a post-processing method to adjust the forecasts to be coherent. Given base forecasts \(\hat{\mathbf y}(t)\) (forecasts for all levels but each level forecasted independently), we use \(\mathbf P\) to map them to the bottom-level forecasts
\(P\) and \(S\)
In the previous section, we discussed the summing matrix \(\color{blue}S\). The summing matrix maps the bottom-level forecasts \(\color{red}{\mathbf b}(t)\) to all forecasts on all levels \(\color{green}\mathbf y(t)\). The example we provided was
If we forecast different levels independently, the forecasts we get
are not necessarily coherent. However, if we can choose a proper \(\mathbf P\), we can convert the base forecasts into some bottom-level forecasts
From the usage, \(\mathbf S\) and \(\mathbf P\) are like conjugates. We have the following relation
It is clear that \(\mathbf P \mathbf S\) is identity if we set
However, this is not the only \(\mathbf P\) we can choose.
To generate the coherent forecasts \(\tilde{\mathbf y}(t)\), we use the summing matrix to map the bottom level forecasts to base forecasts of all levels12
Walmart Sales in Stores
We reuse the example of the Walmart sales data. The base forecasts for all levels are
The simplest mapping to the bottom-level forecasts is
where
are the bottom-level forecasts to be transformed into coherent forecasts.
In this simple method, our mapping matrix \(\mathbf P\) can be
Using this \(\mathbf P\), we get
The last step is to apply the summing matrix
so that
In summary, our coherent forecasts for each level are
The \(\mathbf P\) we used in this example represents the bottom-up method.
Results like \(\tilde s_\mathrm{CA}(t) = \hat s_\mathrm{CA}(t)\) look comfortable but they are not necessary. In other reconciliation methods, these relations might be broken, i.e., \(\tilde s_\mathrm{CA}(t) = \hat s_\mathrm{CA}(t)\) may not be true.
Component Form
We rewrite
using the component form
There is more than one \(\mathbf P\) that can map the forecasts to the bottom-level forecasts. Three of the so-called single-level approaches1 are bottom-up, top-down, and middle-out2.
Apart from these intuitive methods, Wickramasuriya et al. (2017) proposed the MinT method to find the optimal \(\mathbf P\) matrix that gives us the minimal trace of the covariance of the reconciled forecast error3,
with \(\mathbf y\) being the ground truth and \(\tilde{\mathbf y}\) being the coherent forecasts. Wickramasuriya et al. (2017) showed that the optimal \(\mathbf P\) is
where \(W_{h} = \mathbb E\left[ \tilde{\boldsymbol \epsilon} \tilde{\boldsymbol \epsilon}^T \right] = \mathbb E \left[ (\mathbf y(t) - \tilde{\mathbf y}(t))(\mathbf y(t) - \tilde{\mathbf y}(t))^T \right]\) is the covariance matrix of the reconciled forecast error.
\(\hat{\mathbf P} \neq \mathbf I\)
Note that \(\mathbf S\) is not a square matrix and we can't simply apply the inverse on each element,
MinT is easy to calculate but it assumes that the forecasts are unbiased. To relieve this constraint, Van Erven & Cugliari (2013) proposed a game-theoretic method called GTOP4. In deep learning, Rangapuram et al. (2021) developed an end-to-end model for coherent probabilistic hierarchical forecasts2. For these advanced topics, we redirect the readers to the original papers.
MinT Examples¶
Theories¶
To see how the MinT method works, we calculate a few examples based on equation \(\eqref{eq-mint-p}\). For simplicity, we assume that the variance \(\mathbf W\) is diagonal3. Note that the matrix \(\mathbf S \mathbf P\) decides how each original forecast is combined, \(\tilde{\mathbf y} = \mathbf S \mathbf P \hat{\mathbf y}\). It will be the key for us to understand how MinT works.
In the following examples, we observe that the lower variance of the reconciled forecast error \(W_{ii}\), the less change in the reconciled result. Since lower values of \(W_{ii}\) indicate lower reconciled forecast error, reconciliation should not adjust it by a lot.
For a 2-level hierarchical forecasting problem, the shape of the \(\mathbf S\) matrix is (3,2) and we have three values to pre-compute or assume, i.e., the diagonal elements of \(\mathbf W\).
| \(\mathbf S\) | \(\mathbf P\) | \(\mathbf S \mathbf P\) |
|---|---|---|
| \(\left[\begin{matrix}1 & 1\\1 & 0\\0 & 1\end{matrix}\right]\) | \(\left[\begin{matrix}\frac{- \frac{W_{2} W_{3}}{W_{1} + W_{2} + W_{3}} + \frac{W_{1} W_{2} + W_{2} W_{3}}{W_{1} + W_{2} + W_{3}}}{W_{1}} & \frac{W_{1} W_{2} + W_{2} W_{3}}{W_{2} \left(W_{1} + W_{2} + W_{3}\right)} & - \frac{W_{2}}{W_{1} + W_{2} + W_{3}}\\\frac{- \frac{W_{2} W_{3}}{W_{1} + W_{2} + W_{3}} + \frac{W_{1} W_{3} + W_{2} W_{3}}{W_{1} + W_{2} + W_{3}}}{W_{1}} & - \frac{W_{3}}{W_{1} + W_{2} + W_{3}} & \frac{W_{1} W_{3} + W_{2} W_{3}}{W_{3} \left(W_{1} + W_{2} + W_{3}\right)}\end{matrix}\right]\) | \(\left[\begin{matrix}\frac{- \frac{2 W_{2} W_{3}}{W_{1} + W_{2} + W_{3}} + \frac{W_{1} W_{2} + W_{2} W_{3}}{W_{1} + W_{2} + W_{3}} + \frac{W_{1} W_{3} + W_{2} W_{3}}{W_{1} + W_{2} + W_{3}}}{W_{1}} & \frac{- \frac{W_{2} W_{3}}{W_{1} + W_{2} + W_{3}} + \frac{W_{1} W_{2} + W_{2} W_{3}}{W_{1} + W_{2} + W_{3}}}{W_{2}} & \frac{- \frac{W_{2} W_{3}}{W_{1} + W_{2} + W_{3}} + \frac{W_{1} W_{3} + W_{2} W_{3}}{W_{1} + W_{2} + W_{3}}}{W_{3}}\\\frac{- \frac{W_{2} W_{3}}{W_{1} + W_{2} + W_{3}} + \frac{W_{1} W_{2} + W_{2} W_{3}}{W_{1} + W_{2} + W_{3}}}{W_{1}} & \frac{W_{1} W_{2} + W_{2} W_{3}}{W_{2} \left(W_{1} + W_{2} + W_{3}\right)} & - \frac{W_{2}}{W_{1} + W_{2} + W_{3}}\\\frac{- \frac{W_{2} W_{3}}{W_{1} + W_{2} + W_{3}} + \frac{W_{1} W_{3} + W_{2} W_{3}}{W_{1} + W_{2} + W_{3}}}{W_{1}} & - \frac{W_{3}}{W_{1} + W_{2} + W_{3}} & \frac{W_{1} W_{3} + W_{2} W_{3}}{W_{3} \left(W_{1} + W_{2} + W_{3}\right)}\end{matrix}\right]\) |
We visualize the matrix \(\mathbf S\mathbf P\). It is straightforward to verify that it always leads to coherent results.

import sympy as sp
import numpy as np
import seaborn as sns
class MinTMatrices:
def __init__(self, levels: int):
self.levels = levels
@property
def s(self):
s_ident_diag = np.diag([1] * (self.levels - 1)).tolist()
return sp.Matrix(
[
[1] * (self.levels - 1),
] + s_ident_diag
)
@property
def w_diag_elements(self):
return tuple(
sp.Symbol(f"W_{i}")
for i in range(1, self.levels + 1)
)
@property
def w(self):
return sp.Matrix(np.diag(self.w_diag_elements).tolist())
@property
def p_left(self):
return sp.Inverse(
sp.MatMul(sp.Transpose(self.s), sp.Inverse(self.w), self.s)
)
@property
def p_right(self):
return (
sp.MatMul(sp.Transpose(self.s), sp.Inverse(self.w))
)
@property
def p(self):
return sp.MatMul(self.p_left, self.p_right)
@property
def s_p(self):
return sp.MatMul(self.s, self.p)
@property
def s_p_numerical(self):
return sp.lambdify(
self.w_diag_elements,
self.s_p
)
def visualize_s_p(self, w_elements, ax):
sns.heatmap(self.s_p_numerical(*w_elements), annot=True, cbar=False, ax=ax)
ax.grid(False)
ax.set(xticklabels=[], yticklabels=[])
ax.tick_params(bottom=False, left=False)
ax.set_title(f"$W_{{diag}} = {w_elements}$")
return ax
mtm_3 = MinTMatrices(levels=3)
print(
f"s: {sp.latex(mtm_3.s)}\n"
f"p: {sp.latex(mtm_3.p.as_explicit())}\n"
f"s_p: {sp.latex(mtm_3.s_p.as_explicit())}\n"
)
# 2 bottom series, in total three series
mtm_3.s
mtm_3.p
mtm_3.s_p.as_explicit()
w_elements = [
(1,1,1),
(2,1,1)
]
fig, axes = plt.subplots(nrows = 1, ncols=2, figsize=(4 * 2, 4))
for idx, w in enumerate(w_elements):
mtm_3.visualize_s_p(w, axes[idx])
fig.show()
Implementations
There are different methods to get the covariance matrix \(\mathbf W\). We discuss a few examples and their implications.
| method | \(\mathbf W\) | Note |
|---|---|---|
| OLS | \(\mathbf I\) | More weight on the higher levels in the hierarchy |
| Structual Scaling | \(\operatorname{diag}(\mathbf S \mathbf I)\) | Less weight on higher levels compared to OLS |
Real-world Data¶
Code
The code for this subsection can be found in this notebook (also available on Google Colab).
We use a small subset of the M5 competition data to show that MinT works by shifting the values on different hierarchies.
| date | CA | TX | WI | CA_1 | CA_2 | CA_3 | CA_4 | TX_1 | TX_2 | TX_3 | WI_1 | WI_2 | WI_3 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2011-01-29 00:00:00 | 14195 | 9438 | 8998 | 4337 | 3494 | 4739 | 1625 | 2556 | 3852 | 3030 | 2704 | 2256 | 4038 | 32631 |
| 2011-01-30 00:00:00 | 13805 | 9630 | 8314 | 4155 | 3046 | 4827 | 1777 | 2687 | 3937 | 3006 | 2194 | 1922 | 4198 | 31749 |
| 2011-01-31 00:00:00 | 10108 | 6778 | 6897 | 2816 | 2121 | 3785 | 1386 | 1822 | 2731 | 2225 | 1562 | 2018 | 3317 | 23783 |
| 2011-02-01 00:00:00 | 11047 | 7381 | 6984 | 3051 | 2324 | 4232 | 1440 | 2258 | 2954 | 2169 | 1251 | 2522 | 3211 | 25412 |
| 2011-02-02 00:00:00 | 9925 | 5912 | 3309 | 2630 | 1942 | 3817 | 1536 | 1694 | 2492 | 1726 | 2 | 1175 | 2132 | 19146 |
We apply a simple LightGBM model using Darts. The forecasts are not coherent.

Applying MinT method, we reached coherent forecasts for all levels. The following charts shows the example for the top two levels.

Each step was adjusted differently since the forecasted values are different. To see how exactly the forecasted are adjusted to reach coherency, we plot out the difference between the reconciled results and the original forecasts, \(\tilde{\mathbf y} - \hat{\mathbf y}\).

Tools and Packages¶
Darts and hierarchicalforecast from Nixtla provide good support for reconciliations.
-
Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. OTexts.com/fpp3. Accessed on 2022-11-27. ↩↩
-
Rangapuram SS, Werner LD, Benidis K, Mercado P, Gasthaus J, Januschowski T. [End-to-End ↩↩↩
-
Wickramasuriya SL, Athanasopoulos G, Hyndman RJ. Optimal forecast reconciliation for hierarchical and grouped time series through trace minimization. Journal of the American Statistical Association 2019; 114: 804–819. ↩↩
-
Erven T van, Cugliari J. Game-Theoretically optimal reconciliation of contemporaneous hierarchical time series forecasts. In: Modeling and stochastic learning for forecasting in high dimensions. Springer International Publishing, 2015, pp 297–317. ↩
Ended: Hierarchical Time Series
Useful Datasets ↵
Time Series Datasets¶
We list a few useful real-world time series datasets here.
| name | link | descriptions |
|---|---|---|
| ECB Exchange Rate | Website | ECB Exchange Rate Details |
| NREL Solar Power | Website | |
| Electricity | UCI ElectricityLoadDiagrams20112014 Data Set | |
| PEMS | Caltrans PeMS |
Apart from real-world data, we also use synthetic data to demonstrate time series analysis and forecasting. The following are some synthetic time series datasets.
| name | link | descriptions |
|---|---|---|
| Chaotic Systems | williamgilpin/dysts | williamgilpin/dysts |
Time Series Dataset: ECB Exchange Rate¶
We download the time series data in zip format using this link.
We find 41 currencies in this dataset. The earliest date is 1999-01-04.


Time Series Dataset: Solar Energy¶
We download the time series data from this link.
NREL's Solar Power Data for Integration Studies are synthetic solar photovoltaic (PV) power plant data points for the United States representing the year 2006.
When we downloaded Alabama on 2022-11-05, and loaded Actual_30.45_-88.25_2006_UPV_70MW_5_Min.csv as an example. We found a lot of 0 entries (which is expected as the will be no power during dark nights).
| Power is Zero | Number of Records |
|---|---|
| False | 57868 |
| True | 47252 |
The dataset contains multiple files with each file containing a time series with a time step of 5 minutes (naming convention explained here).



Time Series Dataset: Electricity¶
This dataset is provided as the "ElectricityLoadDiagrams20112014 Data Set" on the UCI website. It is the time series of electricity consumption of 370 points/clients.
We download the time series data in zip format using this link.
We find that
- in total 140256 rows and 370 series,
- the earliest time is 2011-01-01 00:15:00,
- the latest time is 2015-01-01 00:00:00,
- a fixed time interval of 15 minutes.
We only plot out three series. We only plot every 100 time steps.

We fine no missing values.

Loading and Basic Cleaning¶
We provide some code to load the data from the UCI website.
import requests, zipfile, io
import pandas as pd
# Download from remote URL
data_uri = "https://archive.ics.uci.edu/ml/machine-learning-databases/00321/LD2011_2014.txt.zip"
r = requests.get(data_uri)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("data/uci_electricity/")
# Load as pandas dataframe
df = pd.read_csv(
"data/uci_electricity/LD2011_2014.txt", delimiter=";", decimal=','
).rename(columns={"Unnamed: 0": "date"}).set_index("date")
df.index = pd.to_datetime(df.index)
Time Series Dataset: PEMS¶
California Department of Transportation (Caltrans) Performance Measurement System (PeMS) provides traffic data on their website. To download the data1,
- Register on the website and wait for approval, then login.
- Go to this page and choose the data we need using the filter on the top.
- For example, we choose Type =
Station 5-Minuteand District =District 3.
- For example, we choose Type =
We do not show examples of this dataset here.
-
VeritasYin. How to download the dataset from PeMS website? · Issue #6 · VeritasYin/STGCN_IJCAI-18. In: GitHub [Internet]. [cited 5 Nov 2022]. Available: https://github.com/VeritasYin/STGCN_IJCAI-18/issues/6 ↩
Ended: Useful Datasets
Ended: Fundamentals of Time Series Forecasting
Trees ↵
Tree-Based Models¶
Trees are still powerful machine-learning models for time series forecasting. We explain the basic ideas of trees in the following sections.
Should I Work from Home?¶
We prepared a notebook for this section here .
To illustrate the idea of trees, we use a simple classification task: Deciding whether a person will go to the office or work from home based on an artificial dataset.
Definition of the problem¶
We will decide whether one should go to work today. In this demo project, we consider the following features.
| feature | possible values |
|---|---|
| health | 0: feeling bad, 1: feeling good |
| weather | 0: bad weather, 1: good weather |
| holiday | 1: holiday, 0: not holiday |
Our prediction will be a binary result, 0 or 1, with 0 indicates staying at home and 1 indicates going to work.
Notations
For more compactness, we can use the abstract notation \(\{0,1\}^3\) to describe a set of three features each with 0 and 1 as possible values. In general, the notation \(\{0,1\}^d\) indicates \(d\) binary features.
Meanwhile, the prediction can be denoted as \(\{0,1\}^1\).
How to Describe a Decision Tree¶
In theory, we would expect a decision tree of the following.
graph TD
A[health] --> |feeling bad| E[stay home]
A[health] --> |feeling good| B[weather]
B --> |bad weather| E
B --> |good weather| C[holiday]
C --> |holiday| E
C --> |not holiday| G[go to the office]
It is straightforward to prove that the max required depths and max required leaves of a model that maps \(\{0,1\}^d\) to \(\{0,1\}^1\) are \(d+1\) and \(2^d\). In our simple example, some of the branches are truncated based on our understanding of the problem. In principle, the branch "feeling bad" could also go on to the next level.
Data¶
However, we are not always lucky enough to be able to forge trees using experience and common sense. It is more common to build the tree using data.
Artificial Dataset
To fit a model, we generated some artificial data using this notebook.
When generating the data, we follow the rule that one only goes to the office, if and only if
- the person is healthy,
- the weather is good, and
- today is not a holiday.
The following table shows a small sample of the dataset.
| health | weather | holiday | go_to_office | |
|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 |
| 1 | 1 | 1 | 1 | 0 |
| 2 | 1 | 0 | 1 | 0 |
| 3 | 0 | 0 | 0 | 0 |
| 4 | 1 | 0 | 1 | 0 |
Build a Tree¶
We use sklearn to build a decision tree, see code here. We observed that the decision tree we get from the data is exactly what we expected.

Reading the Decision Tree Chart
On each node of the tree, we read useful information.
In the root, aka the first node on the top, the feature name and value range are denoted on the first row, i.e., weather<= 0.5, which means that we are making decisions based on whether the value of the weather feature is less or equal to 0.5. If the value is less or equal to 0.5, we go to the left branch, otherwise, we go to the right branch. The following rows in the node are assuming the condition is satisfied.
On the second row, we read the Gini impurity value. Gini impurity is a measure of the impurity of the data under the condition.
On the third row, the number of samples of the given condition (weather <= 0.5) is also given.
Finally, we read the values of the samples. In this example, value = [93, 7], i.e., 93 of the samples have a target value 0, and 7 of the samples have a target value 1.
This is a perfect result as it is the same as our theoretical expectations. This is because we have built our dataset using the rules. Surely we will get a perfect tree.
In reality, our dataset is probabilistic or comes with noise. To see how the noise affects our decision tree, we can build a tree using a perturbed dataset. Here is an example.

A decision tree trained with a fake "impure dataset" with noise that doesn't always fit into our theoretical model. For example, on the leaves, aka, the bottom level, we see some with both going to the office and not going to the office, which corresponds to nonzero Gini impurity value. Though we take the majority target value when doing the predictions, we can already imagine that some of the data points will be misclassified.
How was the Model Built?¶
Many different algorithms can build a decision tree from a given dataset. The Iterative Dichotomizer 3 algorithm, aka ID3 algorithm, is one of the famous implementations of the decision tree1. The following is the "flowchart" of the algorithm1.
graph TD
Leaf("Prepare samples in node")
MajorityVote["Calculate majority vote"]
Assign[Assign label to node]
Leaf --> MajorityVote --> Assign
Assign --> Split1[Split on feature 1]
Assign --> Splitdots["..."]
Assign --> Splitd[Split on feature d]
subgraph "split on a subset of features"
Split1 --> |"Split on feature 1"|B1["Calculate gain of split"]
Splitdots --> |"..."| Bdots["..."]
Splitd --> |"Split on feature d"| Bd["Calculate gain of split"]
end
B1 --> C["Use the split with the largest gain"]
Bdots --> C
Bd --> C
C --> Left["Prepare samples in left node"]
C --> Right["Prepare samples in right node"]
subgraph "left node"
MajorityVoteL["Calculate majority vote"]
AssignL(Assign label to left node)
Left --> MajorityVoteL --> AssignL
end
subgraph "right node"
MajorityVoteR["Calculate majority vote"]
Right --> MajorityVoteR
AssignR(Assign label to right node)
MajorityVoteR --> AssignR
end
To "calculate the gain of the split", here we use Gini impurity. There are other "gains" such as information gain. For regression tasks, we can also have gains such as a MSE loss.
Overfitting¶
Fully grown trees will most likely to overfit the data since they always try to grow pure leaves. Besides, fully grown trees grow exponentially as the number of features grows which requires a lot of computation resources.
Applying Occam's razor, we prefer smaller trees as long as the trees can explain the data well.
To achieve this, we will either have to limit how the trees grow during training or prune the trees after the trees are built. Pruning of a tree is achieved by replacing subtrees at a node with a leaf if certain conditions are based on cost estimations.
-
Shalev-Shwartz S, Ben-David S. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014 doi:10.1017/CBO9781107298019. ↩↩
Random Forest¶
From Ho TK. Random decision forests
"The essence of the method is to build multiple trees in randomly selected subspaces of the feature space."1
Random forest is an ensemble method based on decision trees which are dubbed as base-learners. Instead of using one single decision tree and model on all the features, we utilize a bunch of decision trees and each tree can model on a subset of features (feature subspace). To make predictions, the results from each tree are combined using some democratization.
Translating to math language, given a proper dataset \(\mathscr D(\mathbf X, \mathbf y)\), random forest or the ensemble of trees, denoted as \(\{f_i\}\), will predict an ensemble of results \(\{f_i(\mathbf X_i)\}\), with \(\mathbf X_i \subseteq \mathbf X\).
A Good Reference for Random Forest
Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Science & Business Media; 2013. pp. 567–567.
However, random forest is not "just" ensembling. There are many different ensembling methods, e.g., bootstrapping, but suffer from correlations in the trees. Random forest has two levels of randomization:
- Bootstrapping the dataset by randomly selecting a subset of the training data;
- Random selection of the features to train a tree.
We can already use the bootstrapping step to create many models to ensemble with, however, the randomization of features is also key to a random forest model as it helps reduce the correlations between the trees2. In this section, we ask ourselves the following questions.
- How to democratize the ensemble of results from each tree?
- What determines the quality of the predictions?
- Why does it even work?
Margin, Strength, and Correlations¶
The margin of the model, the strength of the trees, and the correlation between the trees can help us understand how random forest work.
Margin¶
The margin of the tree is defined as34
Terms in the Margin Definition
The first term, \({\color{green}P (\{f_i(\mathbf X)=\mathbf y \})}\) is the probability of predicting the exact value in the dataset. In a random forest model, it can be calculated using
where \(I\) is the indicator function that maps the correct predictions to 1 and the incorrect predictions to 0. The summation is over all the trees.
The term \({\color{red}P (\{f_i(\mathbf X) = \mathbf j \})}\) is the probability of predicting values \(\mathbf j\). The second term \(\operatorname{max}_{\mathbf j\neq \mathbf y} {\color{red}P ( \{f_i(\mathbf X) = \mathbf j\})}\) finds the highest misclassification probabilities, i.e., the max probabilities of predicting values \(\mathbf j\) other than \(\mathbf y\).
Raw Margin
We can also think of the indicator function itself is also a measure of how well the predictions are. Instead of looking into the whole forest and probabilities, the raw margin of a single tree is defined as3
The margin is the expected value of this raw margin over each classifier.
To make it easier to interpret this quantity, we only consider two possible predictions:
- \(M(\mathbf X, \mathbf y) \to 1\): We will always predict the true value, for all the trees.
- \(M(\mathbf X, \mathbf y) \to -1\): We will always predict the wrong value, for all the trees.
- \(M(\mathbf X, \mathbf y) \to 0\), we have an equal probability of predicting the correct value and the wrong value.
In general, we prefer a model with higher \(M(\mathbf X, \mathbf y)\).
Strength¶
However, the margin of the same model is different in different problems. The same model for one problem may give us margin 1 but it might not work that well for a different problem. This can be seen in our decision tree examples.
To bring the idea of margin to a specific problem, Breiman defined the strength \(s\) as the expected value of the margin over the dataset fed into the trees34,
Dataset Fed into the Trees
This may be different in different models since there are different randomization and data selection methods. For example, in bagging, the dataset fed into the trees would be random selections of the training data.
Correlation¶
Naively speaking, we expect each tree takes care of different factors and spit out a different result, for ensembling to provide benefits. To quantify this idea, we define the correlation of raw margin between the trees3
Since the raw margin tells us how likely we can predict the correct value, the correlation defined above indicates how likely two trees are functioning. If all trees are similar, the correlation is high, and ensembling won't provide much in this situation.
To get a scalar value of the whole model, the average correlation \(\bar \rho\) over all the possible pairs is calculated.
Predicting Power¶
The higher the generalization power, the better the model is at new predictions. To measure the goodness of a random forest, the population error can be used,
It has been proved that the error almost converges in the random forest as the number of trees gets large3. The upper bound of the population error is related to the strength and the mean correlation3,
To get a grasp of this upper bound, we plot out the heatmap as a function of \(\bar \rho\) and \(s\).

We observe that
- The stronger the strength, the lower the population error upper bound.
- The smaller the correlation, the lower the population error upper bound.
- If the strength is too low, it is very hard for the model to avoid errors.
- If the correlation is very high, it is still possible to get a decent model if the strength is high.
Random Forest Regressor¶
Similar to decision trees, random forest can also be used as regressors. The random forest regressor population error is capped by the average population error of trees multiplied by the correlation of trees3.
To see how the regressor works with data, we construct an artificial problem. The code can be accessed here .
A random forest with 1600 estimators can estimate the following sin data. Note that this is in-sample fitting and prediction to demonstrate the capability of representing sin data.

One observation is that not all the trees spit out the same values. We observe some quite dispersed predictions from the trees but the ensemble result is very close to the true values.

We generate a new dataset by adding some noise to the sin dataset. By adding uniform random noise, we introduce some variance but not much bias in the data. We are cheating a bit here because this kind of data is what random forest is good at.
We train a random forest model with 1300 estimators using this noise data. Note that this is in-sample fitting and prediction to demonstrate the representation capability.

One observation is that not all the trees spit out the same values. The predictions from the trees are sometimes dispersed and not even bell-like, the ensemble result reflects the values of the true sin data. The ensemble results are even located at the center of the noisy data where the true sin values should be. However, we will see that the distribution of the predictions is more dispersed than the model trained without noise (see the tab "Comparing Tow Scenarios").

The following two charts show the boxes for the two trainings.


To see the differences between the box sizes for in a more quantitive way, we plot out the box plot of the box sizes for each training.

-
Ho TK. Random decision forests. In: Proceedings of 3rd international conference on document analysis and recognition. 1995, pp 278–282 vol.1. ↩
-
Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: Data mining, inference, and prediction. Springer Science & Business Media, 2013. ↩
-
Breiman L. Random forests. Machine learning 2001; 45: 5–32. ↩↩↩↩↩↩↩
-
Bernard S, Heutte L, Adam S. A study of strength and correlation in random forests. In: Advanced intelligent computing theories and applications. Springer Berlin Heidelberg, 2010, pp 186–191. ↩↩
Gradient Boosted Trees¶
Boosted trees is another ensemble method of trees. Similar to random forest, boosted trees makes prediction by combining the predictions from each tree. However, instead of performing average, boosted trees are additive models where the prediction \(f(\mathbf X)\) is the additions of each predictions1,
where \(f_t(\mathbf X)\) is the prediction for tree \(i\) and \(T\) is the total number of trees. Given such a setup, the training becomes very different from random forests. As of 2023, there are two popular implementations of boosted trees, LightGBM and XGBoost. Training a boosted trees model finds a sequence of trees
For a specified loss function \(\mathscr L(\mathbf y, \hat{\mathbf y})\), the sequence of trees helps reducing the loss step by step. At step \(i\), the loss is
To optimize the model, we have to add a tree that reduces the loss the most and approximations are applied for numerical computations2.
The XGBoost documentation and the original paper on XGBoost explains the idea nicely with examples.
Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. arXiv [cs.LG]. 2016. Available: http://arxiv.org/abs/1603.02754
There are more than one realization of gradient boosted trees34.
-
Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: Data mining, inference, and prediction. Springer Science & Business Media, 2013. ↩
-
Chen T, Guestrin C. XGBoost: A scalable tree boosting system. 2016.http://arxiv.org/abs/1603.02754. ↩
-
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W et al. LightGBM: A highly efficient gradient boosting decision tree. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S et al. (eds). Advances in neural information processing systems. Curran Associates, Inc., 2017https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf. ↩
-
Shi Y, Li J, Li Z. Gradient boosting with Piece-Wise linear regression trees. 2018.http://arxiv.org/abs/1802.05640. ↩
Forecasting with Trees Using Darts¶
Darts provides wrappers for tree-based models. In this section, we benchmark random forest and gradient-boosting decison tree (GBDT) on the famous air passenger dataset. Through the benchmarks, we will see the key advantage and disadvantage of tree-based models in forecasting.
Just Run It
The notebooks created to produce the results in this section can be found here for random forest and here for gbdt .
The Simple Random Forest¶
We will build different models to demonstrate the strength and weakness of random forest models. The focus will be in-sample and out-of-sample predictions. We know that trees are not quite good at extrapolating into realms where the out-of-sample distribution is different from the training data, due to the constant values assigned on each leaf. Time series forecasting in real world are often non-stationary and heteroscedastic, which implies that the distribution during test phase may be different from the distribution of the training data.
Data¶
We choose the famous air passenger data. The dataset shows the number of air passengers in each month.

Baseline Model: "Simply Wrap the Model on the Data"¶
A naive idea is to simply wrap a tree-based model on the data. Here is choose RandomForest from scikit-learn.

The predictions are quite off. However, if we look into the in-sample predictions, i.e., time range that the model has already seen during training, we would not have observed such bad predictions.

This indicates that there are some new patterns in the data to be forecasted. That is the distribution of the data is different, i.e., the level of the values is higher than before, and the variance is also higher. This is a typical case where trees are not good. However, trees can handle such cases if we preprocess the data to bring the data to the same level and also reduce the changes in variance.
Detrend but also a Clairvoyant Model¶
To confirm that this is due to the mismatch of the in-sample distribution and the out-of-sample distribution, we plot out the histograms of the training series and the test series.

This hints that we should at least detrend the data. In this example, we use a simple moving average while assuming multiplicative components to detrend the data. The detrended data is shown below. Without even training the model, we immediately see that forecasting for such simple patterns is easier than the patterns in the raw data.

To illustrate that detrending helps, we will cheat a bit to detrend the whole series to confirm that the forecasts are better.

Distribution of Detrended Data

A Formal Model to Use Detrending and without Information Leak¶
The above method leads to a great result, however, with information leakage during the detrending. Nevertheless, this indicates the performance of trees on out-of-sample predictions if we only predict on the cycle part of the series. In a real-world case, however, we have to predict the trend accurately for this to work. To better reconstruct the trend, we use Box-Cox transformations to stablize the variance first. The following plot shows the transformed data.

With the transformed data, we build a simple linear trend using the training dataset and extrapolate the trend to the dates of the prediction.

Finally, we fit a random forest model on the detrended data, i.e., Box-Cox transformed data - linear trend, then reconstruct the predictions, i.e., predictions + linear trend + Inverse Box-Cox transformation. We observed a much better performance than the first RF we built.

Comparisons of the Three Random Forest Models¶
Observations by eyes showed that cheating leads to the best result, followed by a simple linear detrend model.

To formally benchmark the results, we computed several metrics. For most of the metrics, the detrend (cheating) model is the best, which is expected since it is a clairvoyant model peeks into the future. The second best is the box-cox + linear trend model.

Gradient Boosted Trees¶
Similar behavior is also observed for gradient-boosted decision trees (GBDT). We perform exactly the same steps as we did for random forest model. The results are shown below. For GBDT, we also tested the linear tree model too2.

Why Linear Tree
For trees, the predictions are flat lines within a bucket of values. For example, we may get the same prediction for a given feature value 10 and 10.1 as the two values are so close to each other. However, without detrending, we expect the predictions to have a uplift trend in our data.
To count for this, LightGBM has a parameter called linear_tree2. This parameter allows the model to fit a linear model on the leaf nodes to capture the trend. This means that we do not need to detrend the data before fitting the model. In this specific example, we see that the linear tree model performs quite well.
The benchmark metrics for the GBDT are shown below. As expected, linear tree and box-cox + linear trend models beat the baseline model.

Trees are Powerful¶
Up to this point, we may get the feeling that trees are not the best choices for forecasting. As a matter of fact, trees are widely used in many competitions and have achieved a lot in forecasting3. Apart from being simple and robust, trees can also be made probabilistic. Trees are also attractive as our first model to try because they usually already work quite well out of the box1.
-
"Out of the box" sounds like something easy to do. However, if one ever reads the list of parameters of LightGBM, the thought of "easy" will immediately diminish. ↩
-
LightGBM. Parameters — LightGBM 4.3.0.99 documentation. In: LightGBM [Internet]. [cited 2 Feb 2024]. Available: https://lightgbm.readthedocs.io/en/latest/Parameters.html#linear_tree ↩↩
-
Januschowski T, Wang Y, Torkkola K, Erkkilä T, Hasson H, Gasthaus J. Forecasting with trees. International journal of forecasting 2022; 38: 1473–1481. ↩
Ended: Trees
Fundamentals of Deep Learning ↵
Deep Learning Fundamentals¶
Deep learning, as the rising method for time series forecasting, requires the knowledge of some fundamental principles.
In this part, we explain and demonstrate some popular deep learning models. Note that we do not intend to cover all models but only discuss a few popular principles.
The simplest deep learning model, is a fully connected Feedforward Neural Network (FFNN). A FFNN might work for in-distribution predictions, it is likely to overfit and perform poorly for out-of-distribution predictions. In reality, most of the deep learning models are much more complicated than a FFNN, and a large population of deep learning models are utilizing the self-supervised learning concept, providing better generalizations1.
In the following chapters, we provide some popular deep learning architectures and cool ideas.
Notations
In this document, we use the following notations.
- Sets, domains, abstract variables, \(X\), \(Y\);
- Probability distribution \(P\), \(Q\);
- Probability density \(p\), \(q\);
- Slicing arrays from index \(i\) to index \(j\) using \({}_{i:j}\).
-
Liu X, Zhang F, Hou Z, Wang Z, Mian L, Zhang J et al. Self-supervised learning: Generative or contrastive. 2020.http://arxiv.org/abs/2006.08218. ↩
Learning from Data¶
Learning from data is a practice of extracting compressed knowledge about the world from data. There are many frameworks of learning. For example, the induction, deduction, and transduction schema shows different possible paths to produce predictions.
graph LR
P[Prediction]
D[Data]
M[Model]
D --"Induction"--> M
M --"Deduction"--> P
D --"Transduction"--> P
There are two different approaches to making predictions based on some given data.
- Perform induction to find a good model from the data, this is called induction. Once we have a model, we can use it to make predictions, this is called deduction.
- Directly make predictions from the data, this is called transduction.
The Nature of Statistical Learning Theory
Vapnik's seminal book The Nature of Statistical Learning Theory is a very good read for the fundamentals of learning theories1.
Vapnik also discussed some of the key ideas in a book chapter Estimation of Dependences Based on Empirical Data2.
In the context of machine learning, Abu-Mostafa, Magdon-Ismail, and Lin summarized the machine learning problem using the following chart 3. Ultimately, we need to find an approximation \(g\) of the true map \(f\) from features \(\mathcal X\) to targets \(\mathcal Y\) on a specific probability distribution of features \(P\). This process is done by using an algorithm to select a hypothesis that works.
flowchart LR
X[Data Samples]
A[Algorithm]
H[Hypotheses Set]
SH[Selected Hypothesis]
X --> A
H --> A
A --> SH
Based on this framework, a machine learning process usually consists of three core components4.
- Representation: Encoded data and the problem representation.
- Evaluation: An objective function to be evaluated that guides the model.
- Optimization: An algorithm to optimize the model so it learns what we want it to do.
We will reuse this framework again and again in the following sections of this chapter.
-
Vapnik V. The nature of statistical learning theory. Springer: New York, NY, 2010 doi:10.1007/978-1-4757-3264-1. ↩
-
Vapnik V. Estimation of dependences based on empirical data. 1st ed. Springer: New York, NY, 2006 doi:10.1007/0-387-34239-7. ↩
-
Abu-Mostafa YS, Magdon-Ismail M, Lin H-T. Learning from data: A short course. AMLBook, 2012https://www.semanticscholar.org/paper/Learning-From-Data-Abu-Mostafa-Magdon-Ismail/1c0ed9ed3201ef381cc392fc3ca91cae6ecfc698. ↩
-
Domingos P. A few useful things to know about machine learning. Communications of the ACM 2012; 55: 78–87. ↩
Neural Networks¶
Neural networks have been a buzzword for machine learning in recent years. As indicated in the name, artificial neural networks are artificial neurons connected in a network. In this section, we discuss some intuitions and theories of neural networks.
Artificial vs Biological
Neuroscientists also discuss neural networks, or neuronal networks in their research. Those are different concepts from the artificial neural networks we are going to discuss here. In this book, we use the term neural networks to refer to artificial neural networks, unless otherwise specified.
Intuitions¶
We start with some intuitions of neural networks before discussing the theoretical implications.
What is an Artificial Neuron¶
What an artificial neuron does is respond to stimulations. This response could be strong or weak depending on the strength of the stimulations. Here is an example.

Using one simple single neuron, we do not have much to build. It is just the function we observe above. However, by connecting multiple neurons in a network, we could compose complicated functions and generalize the scaler function to multi-dimensional functions.
Before we connect this neuron to a network, we study a few transformations first. The response function can be shifted, scaled, or inverted. The following figure shows the effect of these transformations.

Artificial Neural Network¶
A simple network is a collection of neurons that respond to stimulations, which could come from the responses of other neurons.

A given input signal is spread onto three different neurons. The neurons respond to this signal separately before being combined with different weights. In the language of math, given input \(x\), output \(y(x)\) is
where \(\mathrm{activation}\) is the activation function, i.e., the response behavior of the neuron. This is a single-layer structure.
\(\mathrm{activation} \to \sigma\)
In the following discussions, we will use \(\sigma\) as a drop in replacement for \(\mathrm{activation}\).
To extend this naive and shallow network, we could
- increase the number of neurons on one layer, i.e., go wider, or
- extend the number of layers, i.e., go deeper, or
- add interactions between neurons, or
- include recurrent connections in the network.

Composition Effects¶
To build up intuitions of how multiple neurons work together, we take an example of a network with two neurons. We will solve two problems:
- Find out if a hotel room is hot or cold.
- Find out if the hotel room is comfortable to stay.
The first task can be solved using o single neuron. Suppose our input to the neuron is the temperature of the room. The output of the neuron is a binary value, 1 for hot and 0 for cold. The following figure shows the response of the neuron.

In the figure above, we use red for "hot" and blue for "cold". In this example, the temperature being \(T_1\) means the room is cold, while that being \(T_2\) and \(T_3\) indicate hot rooms.
However, moving on the the second problem, such monotonic functions won't work. It is only comfortable to stay in the hotel room if the temperature is neither too high nor too low. Now consider two neurons in a network. One neuron has a monotonic increasing response to the temperature, while the other has a monotonic decreasing response. The following figure shows the combined response of the two neurons. We observe that the combined response is high only when the temperature is in a certain range.

Suppose we have three rooms with temperatures \(T_1\), \(T_2\), \(T_3\) respectively. Only \(T_2\) falls into the region of large output value which corresponds to the habitable temperature.

Mathematical Formulation
The above example can be formulated as $$ f(x) = \sum_k v_k \sigma(w_k x + u_k) $$ where \(\sigma\) is some sort of monotonic activation function.
It is a form of single hidden layer feedforward network. We will discuss this in more detail in the following sections.
These two examples hint that neural networks are good at classification tasks. Neural networks excel at a variety of tasks. Since this book is about time series, we will demonstrate the power of neural networks in time series analysis.
Universal Approximators¶
Even a single hidden layer feedforward network can approximate any measurable function, regardless of the activation function2. In the case of the commonly discussed sigmoid function as an activation function, a neural network for real numbers becomes
It is a good approximation of continuous functions3.
Kolmogorov's Theorem
Kolmogorov's theorem shows that one can use a finite number of carefully chosen continuous functions to mix up by sums and multiplication with weights to a continuous multivariable function on a compact set4.
Neural Networks Can be Complicated¶
In practice, we observe a lot of problems when the number of neurons grows, e.g., the convergence during the training slows down if we have too many layers in the network (the vanishing gradient problem) 5.
Training
We have not yet discussed how to adjust the parameters in a neural network. The process is called training. The most popular method is backpropagation1.
The reader should understand that a good neural network model is not only about these naive examples but is about many different topics. For example, to solve the vanishing gradient problem, new architectures are proposed, e.g., residual blocks6, new optimization techniques were proposed7, and theories such as information highway also became the key to the success of deep neural networks8.
-
Nielsen MA. How the backpropagation algorithm works. In: Neural networks and deep learning [Internet]. [cited 22 Nov 2023]. Available: http://neuralnetworksanddeeplearning.com/chap2.html ↩
-
Hornik K, Stinchcombe M, White H. Multilayer feedforward networks are universal approximators. Neural networks: the official journal of the International Neural Network Society 1989; 2: 359–366. ↩
-
Cybenko G. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems 1989; 2: 303–314. ↩
-
Hassoun M. Fundamentals of artificial neural networks. The MIT Press, Massachusetts Institute of Technology, 2021https://mitpress.mit.edu/9780262514675/fundamentals-of-artificial-neural-networks/. ↩
-
Hochreiter S, Bengio Y, Frasconi P, Schmidhuber J. Gradient flow in recurrent nets: The difficulty of learning long-term dependencies. In: Kremer SC, Kolen JF (eds). A field guide to dynamical recurrent neural networks. IEEE Press, 2001. ↩
-
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. 2015.http://arxiv.org/abs/1512.03385. ↩
-
Huang G, Sun Y, Liu Z, Sedra D, Weinberger K. Deep networks with stochastic depth. 2016.http://arxiv.org/abs/1603.09382. ↩
-
Srivastava RK, Greff K, Schmidhuber J. Highway networks. 2015.http://arxiv.org/abs/1505.00387. ↩
Recurrent Neural Networks¶
In the section Neural Networks, we discussed the feedforward neural network.
Biological Neural Networks
Biological neural networks contain recurrent units. There are theories that employ recurrent networks to explain our memory3.
Recurrent Neural Network Architecture¶
A recurrent neural network (RNN) can be achieved by including loops in the network, i.e., the output of a unit is fed back to itself. As an example, we show a single unit in the following figure.

On the left, we have the unfolded (unrolled) RNN, while the representation on the right is the compressed form. A simplified mathematical representation of the RNN is as follows:
where \(h(t)\) represents the state of the unit at time \(t\), \(x(t)\) is the input at time \(t\).
RNN and First-order Differential Equation
There are different views of the nature of time series data. Many of the time series datasets are generated by physical systems that follow the laws of physics. Mathematicians and physicists already studied and built up the theories of such systems and the framework we are looking into is dynamical systems.
The vanilla RNN described in \(\eqref{eq-rnn-vanilla}\) is quite similar to a first-order differential equation. For simplicity we use RELU for \(f(\cdot)\).
Note that
where \(h^{(n)}(t)\) is the \(n\)th derivative of \(h(t)\). Assuming that it converges and higher order doesn't contribute much, we rewrite \(\eqref{eq-rnn-vanilla}\) as
where
We have reduced the RNN formula to a first-order differential equation. Without discussing the details, we know that an exponential component \(e^{W_h t}\) will rise in the solution. The component may explode or shrink.
Based on our intuition of differential equations, such a dynamical system usually just blows up or diminishes for a large number of iterations. It can also be shown explicitly if we write down the backprogation formula where the state in the far past contributes little to the gradient. This is the famous vanishing gradient problem in RNN4. One solution to this is to introduce memory in the iterations, e.g., long short-term memory (LSTM) 5.
In the basic example of RNN shown above, the output of the hidden state is fed to itself in the next iteration. In theory, the value to feed back to the unit and how the input and output are calculated can be quite different in different setups 12. In the section Forecasting with RNN, we will show some examples of the different setups.
-
Amidi A, Amidi S. CS 230. In: Recurrent Neural Networks Cheatsheet [Internet]. [cited 22 Nov 2023]. Available: https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks ↩
-
Karpathy A. The Unreasonable Effectiveness of Recurrent Neural Networks. In: Andrej Karpathy blog [Internet]. 2015 [cited 22 Nov 2023]. Available: https://karpathy.github.io/2015/05/21/rnn-effectiveness/ ↩
-
Grossberg S. Recurrent neural networks. Scholarpedia 2013; 8: 1888. ↩
-
Pascanu R, Mikolov T, Bengio Y. On the difficulty of training recurrent neural networks. 2012.http://arxiv.org/abs/1211.5063. ↩
-
Hochreiter S, Schmidhuber J. Long short-term memory. Neural computation 1997; 9: 1735–1780. ↩
Convolutional Neural Networks¶
Transformers ↵
Vanilla Transformers¶
In the seminal paper Attention is All You Need, the legendary transformer architecture was born3.
Quote from Attention Is All You Need
"... the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output."
Transformer has evolved a lot in the past few years and there are a galaxy of variants4.

In this section, we will focus on the vanilla transformer. Jay Alammar wrote an excellent post, named The Illustrated Transformer1. We recommend the reader read the post. We won't cover everything in this section. However, for completeness, we will summarize some of the key ideas of transformers.
Formal Algorithms
For a formal description of the transformer-relevant algorithms, please refer to Phuong & Hutter (2022)5.
The Vanilla Transformer¶
In the vanilla transformer, we can find three key components: Encoder-Decoder, the attention mechanism, and the positional encoding.
Encoder-Decoder¶
It has an encoder-decoder architecture.

We assume that the input \(\mathbf X\) is already embedded and converted to tensors.
The encoder-decoder is simulating the induction-deduction framework of learning. Input \(\mathbf X\) is first encoded into a representation \(\hat{\mathbf X}\) that should be able to capture the minimal sufficient statistics of the input. Then, the decoder takes this representation of minimal sufficient statistics \(\hat{\mathbf X}\) and perform deduction to create the output \(\hat{\mathbf Y}\).
Attention¶
The key to a transformer is its attention mechanism. It utilizes the attention mechanism to look into the relations of the embeddings36. To understand the attention mechanism, we need to understand the query, key, and value. In essence, the attention mechanism is a classifier that outputs the usefulness of the elements in the value, and the usefulness is represented using a matrix formed by the query and the key.
where \(d_k\) is the dimension of the key \(\mathbf K\). For example, we can construct the query, key, and value by applying a linear layer to the input \(\mathbf X\).
Conventions
We follow the convention that the first index of \(\mathbf X\) is the index for the input element. For example, if we have two words as our input, the \(X_{0j}\) is the representation of the first word and \(X_{1j}\) is that for the second.
We also use Einstein notation in this section.
| Name | Definition | Component Form | Comment |
|---|---|---|---|
| Query \(\mathbf Q\) | \(\mathbf Q=\mathbf X \mathbf W^Q\) | \(Q_{ij} = X_{ik} W^{Q}_{kj}\) | Note that the weights \(\mathbf W^Q\) can be used to adjust the size of the query. |
| Key \(\mathbf K\) | \(\mathbf K=\mathbf X \mathbf W^K\) | \(K_{ij} = X_{ik} W^{K}_{kj}\) | In the vanilla scaled-dot attention, the dimension of key is the same as the query. This is why \(\mathbf Q \mathbf K^T\) works. |
| Value \(\mathbf V\) | \(\mathbf V = \mathbf X \mathbf W^V\) | \(V_{ij} = X_{ik} W^{V}_{kj}\) |
The dot product \(\mathbf Q \mathbf K^T\) is
which determines how the elements in the value tensor are mixed, \(A_{ij}V_{jk}\). For identity \(\mathbf A\), we do not mix the rows of \(\mathbf V\).
Classifier
The dot-product attention is like a classifier that outputs the usefulness of the elements in \(\mathbf V\). After training, \(\mathbf A\) should be able to make connections between the different input elements.
We will provide a detailed example when discussing the applications to time series.
Knowledge of Positions¶
Positional information, or time order information for time series input, is encoded by a positional encoder that shifts the embeddings. The simplest positional encoder uses the cyclic nature of trig functions3. By adding such positional information directly to the values before the data flows into the attention mechanism, we can encode the positional information into the attention mechanism2.
-
Alammar J. The Illustrated Transformer. In: Jay Alammar [Internet]. 27 Jun 2018 [cited 14 Jun 2023]. Available: http://jalammar.github.io/illustrated-transformer/ ↩
-
Kazemnejad A. Transformer Architecture: The Positional Encoding. In: Amirhossein Kazemnejad’s Blog [Internet]. 20 Sep 2019 [cited 7 Nov 2023]. Available: https://kazemnejad.com/blog/transformer_architecture_positional_encoding/ ↩
-
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN et al. Attention is all you need. 2017.http://arxiv.org/abs/1706.03762. ↩↩↩
-
Amatriain X. Transformer models: An introduction and catalog. arXiv [csCL] 2023. doi:10.48550/ARXIV.2302.07730. ↩
-
Phuong M, Hutter M. Formal algorithms for transformers. 2022. doi:10.48550/ARXIV.2207.09238. ↩
-
Zhang A, Lipton ZC, Li M, Smola AJ. Dive into deep learning. arXiv preprint arXiv:210611342 2021. ↩
Ended: Transformers
Dynamical Systems ↵
Dynamical Systems¶
A lot of time series data are generated by dynamical systems. One of the most cited examples is the coordinates \(x(t)\), \(y(t)\), \(z(t)\) as functions of time \(t\) in a Lorenz system.
Lorenz System
A Lorenz system is defined by the Lorenz equations1
where \(x\), \(y\), and \(z\) are the coordinates of a particle.
It is a chaotic system that is very sensitive to the initial conditions.
Dynamical Systems
Many real-world systems are dynamical systems. Differential equation is a handy tool to model a dynamical system. For example, the action potentials of a squid giant axon can be modeled by the famous Hodgkin-Huxley model.
A naive philosophy to model time series is to come up with a set of differential equations to model the time series. However, finding clean and interpretable differential equations is not easy. It has been the top game in physics for centuries.
In the following sections, we will discuss a few solutions to model data as dynamical systems.
-
Wikipedia contributors. Lorenz system — Wikipedia, the free encyclopedia. 2023.https://en.wikipedia.org/w/index.php?title=Lorenz\system\oldid=1186188179. ↩
Neural ODE¶
Neural ODE is an elegant idea of combining neural networks and differential equations1. In this section, we will first show some examples of differential equations and then discuss how to combine neural networks and differential equations.
Ordinary Differential Equations¶
A first-order ordinary differential equation is as simple as
where \(h(t)\) is the function that describes the state of a dynamical system. To build up intuitions, we show a few examples of differential equations below.
Examples of Differential Equations
Utility Code for the Following ODE (Run this first)
from abc import ABC, abstractmethod
import numpy as np
from scipy.integrate import odeint
import inspect
from typing import Optional, Any
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
class DiffEqn(ABC):
"""A first order differential equation and the corresponding solutions.
:param t: time steps to solve the differential equation for
:param y_0: initial condition of the ode
"""
def __init__(self, t: np.ndarray, y_0: float, **fn_args: Optional[Any]):
self.t = t
self.y_0 = y_0
self.fn_args = fn_args
@abstractmethod
def fn(self, y: float, t: np.ndarray) -> np.ndarray:
pass
def solve(self) -> np.ndarray:
return odeint(self.fn, self.y_0, self.t)
@abstractmethod
def _formula(self) -> str:
pass
def __str__(self):
return f"{self._formula()}"
def __repr__(self):
return f"{self._formula()}"
def plot(self, ax: Optional[plt.Axes]=None):
if ax is None:
_, ax = plt.subplots(figsize=(10, 6.18))
sns.lineplot(
x=self.t, y=self.solve()[:, 0],
ax=ax
)
ax.set_xlabel("t")
ax.set_ylabel("y")
ax.set_title(f"Solution for ${self}$ with {self.fn_args}")
The logistic model of infectious disease is
class Infections(DiffEqn):
def fn(self, y: float, t: np.ndarray) -> np.ndarray:
r = self.fn_args["r"]
dydt = r * y * (1 - y)
return dydt
def _formula(self):
return r"\frac{dh(t)}{d t} = r * y * (1-y)"
infections_s = Infections(t, 0.1, r=0.9)
infections_s.plot()

The following equation describes an exponentially growing \(h(t)\).
with \(\lambda > 0\).
class Exponential(DiffEqn):
def fn(self, y: float, t: np.ndarray) -> np.ndarray:
lbd = self.fn_args["lbd"]
dydt = lbd * y
return dydt
def _formula(self):
return r"\frac{dh(t)}{d t} = \lambda h(t)"
y0_exponential = 1
t = np.linspace(0, 10, 101)
lbd = 2
exponential = Exponential(t, y0_exponential, lbd=lbd)
exponential.plot()

We construct an oscillatory system using sinusoid,
Naively, we expect the oscillations to be large for larger \(t\). Taking the limit \(t\to\infty\), the first order derivative \(\frac{\mathrm dy}{\mathrm d t}\to\infty\). With this limit, we expect the oscillation amplitude to be infinite.
class SinMultiplyT(DiffEqn):
def fn(self, y: float, t: np.ndarray) -> np.ndarray:
lbd = self.fn_args["lbd"]
dydt = np.sin(lbd * t) * t
return dydt
def _formula(self):
return r"\frac{dh(t)}{d t} = \sin(\lambda t) t"
y0_sin = 1
t = np.linspace(0, 10, 101)
lbd = 2
sin_multiply_t = SinMultiplyT(t, y0_sin, lbd=lbd)
sin_multiply_t.plot()

We design a system that grows according to the receprocal of its value,
class Receprocal(DiffEqn):
def fn(self, y: float, t: np.ndarray) -> np.ndarray:
shift = self.fn_args["shift"]
scale = self.fn_args["scale"]
dydt = 1/(shift + scale * y)
return dydt
def _formula(self):
return r"\frac{dh(t)}{d t} = \frac{1}{shift + scale * y}"
receprocal = Receprocal(t, 1, shift=-5, scale=-10)
receprocal.plot()

Finite Difference Form of Differential Equations¶
Eq. \(\eqref{eq:1st-order-ode}\) can be rewritten in the finite difference form as
with \(\Delta t\) small enough.
Derivatives
The definition of the first-order derivative is $$ h'(t) = \lim_{\Delta t\to 0} \frac{h(t+\Delta t) - h(t)}{\Delta t}. $$
In a numerical computation, \(\lim\) is approached by taking a small \(\Delta t\).
If we take \(\Delta t = 1\), the equation becomes
\(\Delta t = 1\)? Rescale of Time \(t\).
As there is no absolute measure of time \(t\), we can always rescale \(t\) to \(\hat t\) so that \(\Delta t = 1\). However, for the sake of clarity, we will keep \(t\) as the time variable.
Coming back to neural networks, if \(h(t)\) represents the state of a neural network block at depth \(t\), Eq. \(\eqref{eq:1st-order-ode-finite-difference-deltat-1}\) is nothing fancy but the residual connection in ResNet2. This connection between the finite difference form of a first-order differential equation and the residual connection leads to the new family of deep neural network models called neural ode1.
In a neural ODE, we treat each layer of a neural network as a function \(h\) of depth \(t\), i.e., the state of the layer is equivalent to a function \(h(t)\). However, we are not obliged to use \(h\) as the function to directly take in the raw input and output the raw output. NeuralODE is extremely flexible and we could build latent dynamical systems to represent some intrinsic dynamics.

Solving Differential Equations
The right side of the Eq. \(\eqref{eq:1st-order-ode}\) is called the Jacobian. Given any Jacobian \(f(h(t), \theta(t), t)\), we can, at least numerically, perform integration to find out the value of \(h(t)\) at any \(t=t_N\),
In this formalism, we find out the transformed input \(h(t_0)\) by solving this differential equation. In traditional neural networks, we achieve this by stacking many layers of neural networks using skip connections.
In the original Neural ODE paper, the authors used the so-called reverse-model derivative method1.
We will not dive deep into solving differential equations in this section. Instead, we will show some applications of neural ODEs in section Time Series Forecasting with Neural ODE.
-
Chen RTQ, Rubanova Y, Bettencourt J, Duvenaud D. Neural ordinary differential equations. 2018.http://arxiv.org/abs/1806.07366. ↩↩↩
-
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. 2015.http://arxiv.org/abs/1512.03385. ↩
Ended: Dynamical Systems
Energy-based Models ↵
Energy-based Models¶
Energy-based models (EBM) establish relations between different possible values of variables using "energy functions"2. In an EBM, any input data point can be assigned a probability density1. Similar to statistical physics, we can create such probability densities using a partition function. As easy as it sounds, such probability densities usually require a scalar function similar to the energy function in statistical physics. When building the objective functions, we require the configurations that should have the same target label to have low energy, or higher probability density, i.e., to be compatible.
-
Lippe P. Tutorial 9: Deep Autoencoders — UvA DL Notebooks v1.1 documentation. In: UvA Deep Learning Tutorials [Internet]. [cited 20 Sep 2021]. Available: https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial9/AE_CIFAR10.html ↩
-
Le Cun Y, Chopra S, Hadsell R, Ranzato M, Huang FJ. A tutorial on Energy-Based learning. 2006. ↩
Denoising Diffusion Probabilistic Models¶
Many philosophically beautiful deep learning ideas face the tractability problem. Many deep learning models utilize the concept of latent space, e.g., \(\mathbf z\), which is usually a compression of the real data space, e.g., \(\mathbf x\), to enable easier computations for our task.
However, such models usually require the computation of an intractable marginalization of the joint distribution \(p(\mathbf x, \mathbf z)\) over the latent space3. To make such computations tractable, we have to apply approximations or theoretical assumptions. Diffusion models in deep learning establish the connection between the real data space \(\mathbf x\) and the latent space \(\mathbf z\) assuming invertible diffusion processes 4 5.
Objective¶
In a denoising diffusion model, given an input \(\mathbf x^0\) drawn from a complicated and unknown distribution \(q(\mathbf x^0)\), we find
- a latent space with a simple and manageable distribution, e.g., normal distribution, and
- the transformations from \(\mathbf x^0\) to \(\mathbf x^n\), as well as
- the transformations from \(\mathbf x^n\) to \(\mathbf x^0\).
Image Data Example
The following figure is taken from Sohl-Dickstein et al. (2015)5.

The forward process, shown in the first row, diffuses the original spiral data at \(t=0\) into a Gaussian noise at \(t=T\). The reverse process, shown in the second row, recovers the original data from \(t=T\) into the image at \(t=0\).
In the following texts, we use \(n\) instead of \(t\).
An Example with \(N=5\)¶
For example, with \(N=5\), the forward process is
flowchart LR
x0 --> x1 --> x2 --> x3 --> x4 --> x5
and the reverse process is
flowchart LR
x5 --> x4 --> x3 --> x2 --> x1 --> x0
The joint distribution we are searching for is
A diffusion model assumes a simple diffusion process, e.g.,
This simulates an information diffusion process. The information in the original data is gradually smeared.
If the chosen diffusion process is reversible, the reverse process of it can be modeled by a similar Markov process
This reverse process is the denoising process.
As long as our model estimates \(p_\theta (\mathbf x^n \vert \mathbf x^{n-1})\) nicely, we can go \(\mathbf x^0 \to \mathbf x^N\) and \(\mathbf x^N \to \mathbf x^0\).
The Reverse Process: A Gaussian Example¶
With Eq \(\eqref{eq-guassian-noise}\), the reverse process is
Summary¶
- Forward: perturbs data to noise, step by step;
- Reverse: converts noise to data, step by step.
flowchart LR
prior["prior distribution"]
data --"forward Markov chain"--> noise
noise --"reverse Markov chain"--> data
prior --"sampling"--> noise
Optimization¶
The forward chain is predefined. To close the loop, we have to find \(p_\theta\). A natural choice for our loss function is the negative log-likelihood,
(Ho et al., 2020) proved that the above loss has an upper bound related to the diffusion process defined in Eq \(\eqref{eq-guassian-noise}\)1
where \(\epsilon\) is a sample from \(\mathcal N(0, \mathbf I)\). The second step assumes the Gaussian noise in Eq \(\eqref{eq-guassian-noise}\), which is equivalent to1
with \(\alpha_n = 1 - \beta _ n\), \(\bar \alpha _ n = \Pi _ {i=1}^n \alpha_i\), and \(\Sigma_\theta\) in Eq \(\eqref{eqn-guassian-reverse-process}\).
Code¶
Rogge & Rasul (2022) wrote a post with detailed annotations of the denoising diffusion probabilistic model2.
-
Rasul K, Seward C, Schuster I, Vollgraf R. Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting. arXiv [cs.LG]. 2021. Available: http://arxiv.org/abs/2101.12072 ↩↩
-
Rogge N, Rasul K. The Annotated Diffusion Model. In: Hugging Face Blog [Internet]. 7 Jun 2022 [cited 18 Feb 2023]. Available: https://huggingface.co/blog/annotated-diffusion ↩
-
Luo C. Understanding diffusion models: A unified perspective. 2022.http://arxiv.org/abs/2208.11970. ↩
-
Sohl-Dickstein J, Weiss EA, Maheswaranathan N, Ganguli S. Deep unsupervised learning using nonequilibrium thermodynamics. 2015.http://arxiv.org/abs/1503.03585. ↩
-
Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. 2020.http://arxiv.org/abs/2006.11239. ↩↩
Ended: Energy-based Models
Generative Models ↵
Generative Models¶

Generative models come with
- an encoder,
- an explicit latent space, and
- a decoder.
Autoregressive Model¶

An autoregressive (AR) model is autoregressive,
Notations and Conventions
In AR models, we have to mention the preceding nodes (\(\{x_{<t}\}\)) of a specific node (\(x_{t}\)). For \(t=5\), the relations between \(\{x_{<5}\}\) and \(x_5\) are shown in the following illustration.

There are different notations for such relations.
- In Uria et al., the authors use \(p(x_{o_d}\mid \mathbf x_{o_{<d}})\) 1.
- In Liu et al. and Papamakarios et al., the authors use \(p(x_{t}\mid \mathbf x_{1:t-1})\) 64.
- In Germain et al., the authors use \(p(x_t\mid \mathbf x_{<t})\) 5.
In the current review, we expanded the vector notation \(\mathbf x_{<t}\) into a set notation as it is not necessarily a vector.
-
Uria B, Côté M-A, Gregor K, Murray I, Larochelle H. Neural Autoregressive Distribution Estimation. arXiv [cs.LG]. 2016. Available: http://arxiv.org/abs/1605.02226 ↩
-
Triebe O, Laptev N, Rajagopal R. AR-Net: A simple Auto-Regressive Neural Network for time-series. arXiv [cs.LG]. 2019. Available: http://arxiv.org/abs/1911.12436 ↩
-
Ho G. George Ho. In: Eigenfoo [Internet]. 9 Mar 2019 [cited 19 Sep 2021]. Available: https://www.eigenfoo.xyz/deep-autoregressive-models/ ↩
-
Papamakarios G, Pavlakou T, Murray I. Masked Autoregressive Flow for Density Estimation. arXiv [stat.ML]. 2017. Available: http://arxiv.org/abs/1705.07057 ↩
-
Germain M, Gregor K, Murray I, Larochelle H. MADE: Masked autoencoder for distribution estimation. 32nd International Conference on Machine Learning, ICML 2015. 2015;2: 881–889. Available: http://arxiv.org/abs/1502.03509 ↩
-
Liu X, Zhang F, Hou Z, Wang Z, Mian L, Zhang J, et al. Self-supervised Learning: Generative or Contrastive. arXiv [cs.LG]. 2020. Available: http://arxiv.org/abs/2006.08218 ↩
-
Lippe P. Tutorial 12: Autoregressive Image Modeling — UvA DL Notebooks v1.1 documentation. In: UvA Deep Learning Tutorials [Internet]. [cited 20 Sep 2021]. Available: https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial12/Autoregressive_Image_Modeling.html ↩
-
rogen-george. rogen-george/Deep-Autoregressive-Model. In: GitHub [Internet]. [cited 20 Sep 2021]. Available: https://github.com/rogen-george/Deep-Autoregressive-Model ↩
Autoencoders¶
Autoencoders (AE) are machines that encode inputs into a compact latent space.

Notation: dot (\(\cdot\))
We use a single vertically centered dot, i.e., \(\cdot\), to indicate that the function or machine can take in arguments.
A simple autoencoder can be achieved using two neural nets, e.g.,
where in this simple example,
- \({\color{blue}g(b + w \cdot )}\) is the encoder, and
- \({\color{red}\sigma(c + v \cdot )}\) is the decoder.
For binary labels, we can use a simple cross entropy as the loss.
Code¶
See Lippe1.
-
Lippe P. Tutorial 9: Deep Autoencoders — UvA DL Notebooks v1.1 documentation. In: UvA Deep Learning Tutorials [Internet]. [cited 20 Sep 2021]. Available: https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial9/AE_CIFAR10.html ↩
Variational AutoEncoder¶
Variational AutoEncoder (VAE) is very different from AE. In VAE, we introduce a variational distribution \(q\) to help us work out the weighted integral after introducing the latent space variable \(z\),
In the above derivation,
- \({}_\theta\) is the model for inference, and
- \({}_\phi\) is the model for variational approximation.
Tricks
- \(p_\theta(x\mid z)\) is usually Gaussian distribution of \(x\) but with mean parameterized by the latent variable \(z\) and the model parameters \(\theta\).
- The latent space variable \(p(z)\) is usually assumed to be a normal distribution.
- The marginalization of the latent variable increase the expressive power.
- Instead of modeling a complex likelihood \(p(x\mid z)\) directly, we only need to model parameters of Gaussian distributions, e.g., a function \(f(z, \theta)\) for the mean of the Gaussian distribution.

From simple distribution in latent space to a more complex distribution. [Doersch2016]
The demo looks great. However, sampling from latent space becomes more difficult as the dimension of the latent space increase. We need a more efficient way to sample from the latent space. We use the variational method which uses a model that samples \(z\) based on \(x\) to sample \(z\), i.e., introduce a function \(q(z\mid x)\) to help us with sampling.
In the derivation, we used \(\int dz q(z\mid x) = 1\).
The term \(F(x)\) is the free energy, while the negative of it, \(-F(x)=\mathcal L\), is the so-called Evidence Lower Bound (ELBO),
We also dropped the term \(D_{\mathrm{KL}}( q(z\mid x)\parallel p(z\mid x) )\) which is always non-negative. The reason is that we can not maximize this KL divergence as we do not know \(p(z\mid x)\). But the KL divergence is always non-negative. So if we find a \(q\) that can maximize \(\mathcal L\), then we are also minimizing the KL divergence (with a function \(q(z\mid x)\) that is close to \(p(z\mid x)\)) and maximizing the loglikelihood loss. Now we only need to find a way to maximize \(\mathcal L\).
More about this ELBO
We do not know \(p(x,z)\) either but we can rewrite \(\mathcal L\),
Our loss function becomes
where \({\color{blue}q(z\mid x) }\) is our encoder which encodes data \(x\) to the latent data \(z\), and \({\color{red}p(x\mid z)}\) is our decoder. The second term ensures our encoder is similar to our priors.
Using Neural networks¶
We model the parameters of the Gaussian distribution \(p_\theta(x\mid z)\), e.g., \(f(z, \theta)\), using a neural network.
In reality, we choose a gaussian form of the variational functional with the mean and variance depending on the data \(x\) and the latent variable \(z\)
We have
Why don't we simply draw \(q\) from \(p(z)\)?
If we are sort of minimizing the KL divergence \(\operatorname{KL} \left( {\color{blue}q(z\mid x) }\parallel p(z) \right)\) too, why don't we simply draw \(q\) from \(p(z)\)? First of all, we also have to take care of the first term. Secondly, we need a latent space that connects to the actual data for reconstruction.
Structure¶

Doersch wrote a very nice tutorial on VAE1. We can find the detailed structures of VAE.
Another key component of VAE is the reparametrization trick. The variational approximation \(q_\phi\) is usually a Gaussian distribution. Once we get the parameters for the Gaussian distribution, we will have to sample from the Gaussian distribution based on the parameters. However, this sampling process prohibits us from propagating errors. The reparametrization trick solves this problem.
Loss Explanation¶

VAE Loss Explained 1
Code¶
See Lippe2.
-
Doersch C. Tutorial on Variational Autoencoders. arXiv [stat.ML]. 2016. Available: http://arxiv.org/abs/1606.05908 ↩↩
-
Lippe P. Tutorial 9: Deep Autoencoders — UvA DL Notebooks v1.1 documentation. In: UvA Deep Learning Tutorials [Internet]. [cited 20 Sep 2021]. Available: https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial9/AE_CIFAR10.html ↩
Flow¶

For a probability density \(p(x)\) and a transformation of coordinate \(x=g(z)\) or \(z=f(x)\), the density can be expressed using the coordinate transformations, i.e.,
where the Jacobian is
The operation \(g_{*}\circ \tilde p(z)\) is the push forward of \(\tilde p(z)\). The operation \(g_{*}\) will pushforward simple distribution \(\tilde p(z)\) to a more complex distribution \(p(x)\).
- The generative direction: sample \(z\) from distribution \(\tilde p(z)\), apply transformation \(g(z)\);
- The normalizing direction: "simplify" \(p(x)\) to some simple distribution \(\tilde p(z)\).
The key to the flow model is the chaining of the transformations
where
-
Liu X, Zhang F, Hou Z, Wang Z, Mian L, Zhang J, et al. Self-supervised Learning: Generative or Contrastive. arXiv [cs.LG]. 2020. Available: http://arxiv.org/abs/2006.08218 ↩
GAN¶
GAN is a generative neural sampler1. To train the sampler, the task of GAN is designed to generate features \(X\) from a latent space \(\xi\) and class labels \(Y\),
Many different formulations of GANs are proposed. As an introduction to this topic, we will discuss vanilla GAN in this section3.
GAN Theory¶
The Minimax Game Loss¶
The minimax game is a game to "minimizing the possible loss for a worst-case"2. In GAN, the game is to train the generator \({\color{red}G}\) to fool the discriminator \({\color{green}D}\) while minimizing the discrimination error of \({\color{green}D}\).
Goodfellow prosed a loss3
Divergence¶
Goodfellow et al proved that the global minimum of such a setup is reached only if only \(p_{G} = p_\text{data}\). GAN compares the generated distribution to the data distribution, using the Jensen-Shannon divergence3,
Off by a Constant
The value function of GAN for fixed \(G\) is slightly different from JS divergence3,
Alternating Training¶
GAN training requires two stages,
- train discriminator \({\color{green}D}\), and
- train generator \({\color{red}G}\).

GAN Code¶
We built a simple GAN using MNIST dataset.
The generated images look quite close to handwritings.

import matplotlib.pyplot as plt
import torch
from pathlib import Path
import torchvision
import torchvision.transforms as transforms
from loguru import logger
from torch import nn
import click
logger.debug(f"Setting device ...")
device = ""
if torch.cuda.is_available():
device = torch.device("cuda")
else:
device = torch.device("cpu")
logger.info(f"Device in use: {device}")
def plot_images(image_samples, target):
"""Plot a grid of images and save to a file."""
if not Path(target).parent.exists():
Path(target).parent.mkdir(parents=True)
# real_samples, mnist_labels = next(iter(train_loader))
for i in range(16):
ax = plt.subplot(4, 4, i + 1)
plt.imshow(image_samples[i].reshape(28, 28), cmap="gray_r")
plt.xticks([])
plt.yticks([])
plt.savefig(target)
def get_data_loaders(batch_size=32, data_dir="data/mnist", download=True, plot_samples=True):
"""Get MNIST data and built a dataloader for the dataset"""
transform = transforms.Compose(
[transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))]
)
train_set = torchvision.datasets.MNIST(
root=data_dir, train=True, download=download, transform=transform
)
train_loader = torch.utils.data.DataLoader(
train_set, batch_size=batch_size, shuffle=True
)
if plot_samples:
real_samples, mnist_labels = next(iter(train_loader))
plot_images(real_samples, target="assets/real_images/real_image_samples.png")
return train_loader
class Discriminator(nn.Module):
"""The discrimnator should take data that has the dimension of the image and spit out a probability"""
def __init__(self):
super().__init__()
self.model = nn.Sequential(
nn.Linear(784, 1024),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(1024, 512),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(512, 256),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(256, 1),
nn.Sigmoid(),
)
def forward(self, x):
x = x.view(x.size(0), 784)
output = self.model(x)
return output
class Generator(nn.Module):
"""The generator should take in some noise data (a latent space data) and spit out an image.
We use the input noise as a trick to make the generator more general
"""
def __init__(self):
super().__init__()
self.model = nn.Sequential(
# nn.Linear(10, 100),
# nn.ReLU(),
nn.Linear(100, 256),
nn.ReLU(),
nn.Linear(256, 512),
nn.ReLU(),
nn.Linear(512, 1024),
nn.ReLU(),
nn.Linear(1024, 784),
nn.Tanh(),
)
def forward(self, x):
output = self.model(x)
output = output.view(x.size(0), 1, 28, 28)
return output
@click.command()
@click.option("--epochs", default=50, help="Number of epochs for the training")
@click.option("--learning_rate", "-lr", default=0.0001, help="Learning rate for the optimizer")
@click.option("--batch_size", default=32, help="Batch size")
@click.option("--data_dir", default="data/mnist", help="Directory for storing the dataset")
@click.option("--download_mnist", "-d", default=True, type=bool, help="Whether to download MNIST data")
@click.option("--random_seed", "-rs", default=42, type=int, help="Random seed for the random generators")
def main(epochs, learning_rate, batch_size, data_dir, download_mnist, random_seed):
latent_space_dim = 100
torch.manual_seed(random_seed)
# check the dtypes
logger.debug(
f"torch tensor dtype: {torch.tensor([1.2, 3]).dtype}"
)
# torch.set_default_dtype(torch.float64)
# logger.debug(
# f"set torch tensor dtype to 64: {torch.tensor([1.2, 3]).dtype}"
# )
train_loader = get_data_loaders(
batch_size=batch_size,
data_dir=data_dir,
download=download_mnist
)
logger.debug(f"Training data is ready")
discriminator = Discriminator().to(device=device)
generator = Generator().to(device=device)
loss_function = nn.BCELoss()
optimizer_discriminator = torch.optim.Adam(
discriminator.parameters(), lr=learning_rate
)
optimizer_generator = torch.optim.Adam(generator.parameters(), lr=learning_rate)
for epoch in range(epochs):
for n, (real_samples, mnist_labels) in enumerate(train_loader):
# We prepare some data for training the discriminator
# Here we will prepare both the generated data and the real data
real_samples = real_samples.to(device=device)
real_samples_labels = torch.ones((batch_size, 1)).to(device=device)
latent_space_samples = torch.randn((batch_size, latent_space_dim)).to(device=device)
# logger.debug(f"Latent space samples: {latent_space_samples}")
generated_samples = generator(latent_space_samples)
# logger.debug(f"Generated samples:{generated_samples}")
generated_samples_labels = torch.zeros((batch_size, 1)).to(device=device)
all_samples = torch.cat((real_samples, generated_samples))
all_samples_labels = torch.cat((real_samples_labels, generated_samples_labels))
# Training the discriminator
# The discrinimator is trained using the samples we generated above, i.e.
# the generated samples and the real images
discriminator.zero_grad()
output_discriminator = discriminator(all_samples)
loss_discriminator = loss_function(output_discriminator, all_samples_labels)
loss_discriminator.backward()
optimizer_discriminator.step()
# Generate some noise data for training the generator
#
latent_space_samples_generator = torch.randn((batch_size, latent_space_dim)).to(device=device)
# Training the generator using the training optimizer
generator.zero_grad()
generated_samples_generator = generator(latent_space_samples_generator)
output_discriminator_generated = discriminator(generated_samples_generator)
loss_generator = loss_function(
output_discriminator_generated, real_samples_labels
)
loss_generator.backward()
optimizer_generator.step()
# Show loss
if n == batch_size - 1:
print(f"Epoch: {epoch} Loss D.: {loss_discriminator}")
print(f"Epoch: {epoch} Loss G.: {loss_generator}")
logger.debug(f"Plotting for epoch: {epoch} ...")
latent_space_samples_epoch = torch.randn(batch_size, latent_space_dim).to(device=device)
generated_samples_epoch = generator(latent_space_samples_epoch)
generated_samples_epoch = generated_samples_epoch.cpu().detach()
plot_images(generated_samples_epoch, target=f"assets/generated_images/generated_image_samples_{epoch}.png")
logger.debug(f"Saved plots for epoch: {epoch}")
latent_space_samples = torch.randn(batch_size, latent_space_dim).to(device=device)
generated_samples = generator(latent_space_samples)
logger.debug(f"Plot generated images...")
generated_samples = generated_samples.cpu().detach()
plot_images(generated_samples, target="assets/generated_images/generated_image_samples.png")
if __name__ == "__main__":
main()
f-GAN¶
The essence of GAN is comparing the generated distribution \(p_G\) and the data distribution \(p_\text{data}\). The vanilla GAN considers the Jensen-Shannon divergence \(\operatorname{D}_\text{JS}(p_\text{data}\Vert p_{G})\). The discriminator \({\color{green}D}\) serves the purpose of forcing this divergence to be small.
Why do we need the discriminator?
If the JS divergence is an objective, why do we need the discriminator? Even in f-GAN we need a functional to approximate the f-divergence. This functional we choose works like the discriminator of GAN.
There exists a more generic form of JS divergence, which is called f-divergence6. f-GAN obtains the model by estimating the f-divergence between the data distribution and the generated distribution1.
Variational Divergence Minimization¶
The Variational Divergence Minimization (VDM) extends the variational estimation of f-divergence1. VDM searches for the saddle point of an objective \(F({\color{red}\theta}, {\color{blue}\omega})\), i.e., min w.r.t. \(\theta\) and max w.r.t \({\color{blue}\omega}\), where \({\color{red}\theta}\) is the parameter set of the generator \({\color{red}Q_\theta}\), and \({\color{blue}\omega}\) is the parameter set of the variational approximation to estimate f-divergence, \({\color{blue}T_\omega}\).
The objective \(F({\color{red}\theta}, {\color{blue}\omega})\) is related to the choice of \(f\) in f-divergence and the variational functional \({\color{blue}T}\),
In the above objective,
- \(f^*\) is the Legendre–Fenchel transformation of \(f\), i.e., \(f^*(t) = \operatorname{sup}_{u\in \mathrm{dom}_f}\left\{ ut - f(u) \right\}\).
\(T\)
The function \(T\) is used to estimate the lower bound of f-divergence1.
We estimate
- \(\mathbb E_{x\sim p_\text{data}}\) by sampling from the mini-batch, and
- \(\mathbb E_{x\sim {\color{red}Q_\theta} }\) by sampling from the generator.
Reduce to GAN
The VDM loss can be reduced to the loss of GAN by setting1
It is straightforward to validate that the following result is a solution to the above set of equations,
Code¶
InfoGAN¶
In GAN, the latent space input is usually random noise, e.g., Gaussian noise. The objective of GAN is a very generic one. It doesn't say anything about how exactly the latent space will be used. This is not desirable in many problems. We would like to have more interpretability in the latent space. InfoGAN introduced constraints to the objective to enforce the interpretability of the latent space8.
Constraint¶
The constraint InfoGAN proposed is mutual information,
where
- \(c\) is the latent code,
- \(z\) is the random noise input,
- \(V({\color{green}D}, {\color{red}G})\) is the objective of GAN,
- \(I(c; {\color{red}G}(z,c))\) is the mutual information between the input latent code and generated data.
Using the lambda multiplier, we punish the model if the generator loses information in latent code \(c\).
Training¶

The training steps are almost the same as GAN but with one extra loss to be calculated in each mini-batch.
- Train \(\color{red}G\) using loss: \(\operatorname{MSE}(v', v)\);
- Train \(\color{green}D\) using loss: \(\operatorname{MSE}(v', v)\);
- Apply Constraint:
- Sample data from mini-batch;
- Calculate loss \(\lambda_{l} H(l';l)+\lambda_c \operatorname{MSE}(c,c')\)
Python Code¶
-
Nowozin S, Cseke B, Tomioka R. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. arXiv [stat.ML]. 2016. Available: http://arxiv.org/abs/1606.00709 ↩↩↩↩↩↩
-
Contributors to Wikimedia projects. Minimax. In: Wikipedia [Internet]. 5 Aug 2021 [cited 6 Sep 2021]. Available: https://en.wikipedia.org/wiki/Minimax ↩
-
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative Adversarial Networks. arXiv [stat.ML]. 2014. Available: http://arxiv.org/abs/1406.2661 ↩↩↩↩
-
Liu X, Zhang F, Hou Z, Wang Z, Mian L, Zhang J, et al. Self-supervised Learning: Generative or Contrastive. arXiv [cs.LG]. 2020. Available: http://arxiv.org/abs/2006.08218 ↩
-
Arjovsky M, Chintala S, Bottou L. Wasserstein GAN. arXiv [stat.ML]. 2017. Available: http://arxiv.org/abs/1701.07875 ↩
-
Contributors to Wikimedia projects. F-divergence. In: Wikipedia [Internet]. 17 Jul 2021 [cited 6 Sep 2021]. Available: https://en.wikipedia.org/wiki/F-divergence#Instances_of_f-divergences ↩
-
Contributors to Wikimedia projects. Convex conjugate. In: Wikipedia [Internet]. 20 Feb 2021 [cited 7 Sep 2021]. Available: https://en.wikipedia.org/wiki/Convex_conjugate ↩
-
Chen X, Duan Y, Houthooft R, Schulman J, Sutskever I, Abbeel P. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. arXiv [cs.LG]. 2016. Available: http://arxiv.org/abs/1606.03657 ↩
-
Agakov DBF. The im algorithm: a variational approach to information maximization. Adv Neural Inf Process Syst. 2004. Available: https://books.google.com/books?hl=en&lr=&id=0F-9C7K8fQ8C&oi=fnd&pg=PA201&dq=Algorithm+variational+approach+Information+Maximization+Barber+Agakov&ots=TJGrkVS610&sig=yTKM2ZdcZQBTY4e5Vqk42ayUDxo ↩
Ended: Generative Models
Ended: Fundamentals of Deep Learning
Time Series Forecasting with Deep Learning ↵
Time Series Forecasting with Deep Learning¶
In the chapter Deep Learning Fundamentals, we discussed some deep learning models. In this chapter, we will discuss how to apply deep learning models to time series forecasting problems.
Creating Dataset for Deep Learning Models¶
Deep learning models usually require batches of data to train. For time series data, we need to slice along the time axis to create batches. In section The Time Delay Embedding Representation, we discussed methods to represent time series data. In this section, we provide an example.
In our ts_dl_utils package, we provide a class called DataFrameDataset. This class moves along the time axis and cuts the time series into multiple data points.
from typing import Tuple
import numpy as np
import pandas as pd
from loguru import logger
from torch.utils.data import Dataset
class DataFrameDataset(Dataset):
"""A dataset from a pandas dataframe.
For a given pandas dataframe, this generates a pytorch
compatible dataset by sliding in time dimension.
```python
ds = DataFrameDataset(
dataframe=df, history_length=10, horizon=2
)
```
:param dataframe: input dataframe with a DatetimeIndex.
:param history_length: length of input X in time dimension
in the final Dataset class.
:param horizon: number of steps to be forecasted.
:param gap: gap between input history and prediction
"""
def __init__(
self, dataframe: pd.DataFrame, history_length: int, horizon: int, gap: int = 0
):
super().__init__()
self.dataframe = dataframe
self.history_length = history_length
self.horzion = horizon
self.gap = gap
self.dataframe_rows = len(self.dataframe)
self.length = (
self.dataframe_rows - self.history_length - self.horzion - self.gap + 1
)
def moving_slicing(self, idx: int, gap: int = 0) -> Tuple[np.ndarray, np.ndarray]:
x, y = (
self.dataframe[idx : self.history_length + idx].values,
self.dataframe[
self.history_length
+ idx
+ gap : self.history_length
+ self.horzion
+ idx
+ gap
].values,
)
return x, y
def _validate_dataframe(self) -> None:
"""Validate the input dataframe.
- We require the dataframe index to be DatetimeIndex.
- This dataset is null aversion.
- Dataframe index should be sorted.
"""
if not isinstance(
self.dataframe.index, pd.core.indexes.datetimes.DatetimeIndex
):
raise TypeError(
"Type of the dataframe index is not DatetimeIndex"
f": {type(self.dataframe.index)}"
)
has_na = self.dataframe.isnull().values.any()
if has_na:
logger.warning("Dataframe has null")
has_index_sorted = self.dataframe.index.equals(
self.dataframe.index.sort_values()
)
if not has_index_sorted:
logger.warning("Dataframe index is not sorted")
def __getitem__(self, idx: int) -> Tuple[np.ndarray, np.ndarray]:
if isinstance(idx, slice):
if (idx.start < 0) or (idx.stop >= self.length):
raise IndexError(f"Slice out of range: {idx}")
step = idx.step if idx.step is not None else 1
return [
self.moving_slicing(i, self.gap)
for i in range(idx.start, idx.stop, step)
]
else:
if idx >= self.length:
raise IndexError("End of dataset")
return self.moving_slicing(idx, self.gap)
def __len__(self) -> int:
return self.length
For example, give a time series dataset,
| index | y |
|---|---|
| 0 | 0 |
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 5 |
| 6 | 6 |
| 7 | 7 |
| 8 | 8 |
| 9 | 9 |
| 10 | 10 |
| 11 | 11 |
| 12 | 12 |
| 13 | 13 |
| 14 | 14 |
The first data point of DataFrameDataset(dataframe=df, history_length=10, horizon=1) will be
(array([[0],
[1],
[2],
[3],
[4],
[5],
[6],
[7],
[8],
[9]]),
array([[10]]))
Pendulum Dataset¶
We create a synthetic dataset based on the physical model called pendulum. The pendulum is modeled as a damped harmonic oscillator, i.e.,
where \(\theta(t)\) is the angle of the pendulum at time \(t\). The period \(p\) is calculated using
with \(L\) being the length of the pendulum and \(g\) being the surface gravity.

import math
from functools import cached_property
from typing import Dict, List
import pandas as pd
class Pendulum:
"""Class for generating time series data for a pendulum.
The pendulum is modelled as a damped harmonic oscillator, i.e.,
$$
\theta(t) = \theta(0) \cos(2 \pi t / p)\exp(-\beta t),
$$
where $\theta(t)$ is the angle of the pendulum at time $t$.
The period $p$ is calculated using
$$
p = 2 \pi \sqrt(L / g),
$$
with $L$ being the length of the pendulum
and $g$ being the surface gravity.
:param length: Length of the pendulum.
:param gravity: Acceleration due to gravity.
"""
def __init__(self, length: float, gravity: float = 9.81) -> None:
self.length = length
self.gravity = gravity
@cached_property
def period(self) -> float:
"""Calculate the period of the pendulum."""
return 2 * math.pi * math.sqrt(self.length / self.gravity)
def __call__(
self,
num_periods: int,
num_samples_per_period: int,
initial_angle: float = 0.1,
beta: float = 0,
) -> Dict[str, List[float]]:
"""Generate time series data for the pendulum.
Returns a list of floats representing the angle
of the pendulum at each time step.
:param num_periods: Number of periods to generate.
:param num_samples_per_period: Number of samples per period.
:param initial_angle: Initial angle of the pendulum.
"""
time_step = self.period / num_samples_per_period
steps = []
time_series = []
for i in range(num_periods * num_samples_per_period):
t = i * time_step
angle = (
initial_angle
* math.cos(2 * math.pi * t / self.period)
* math.exp(-beta * t)
)
steps.append(t)
time_series.append(angle)
return {"t": steps, "theta": time_series}
pen = Pendulum(length=100)
df = pd.DataFrame(pen(10, 400, initial_angle=1, beta=0.001))
_, ax = plt.subplots(figsize=(10, 6.18))
df.plot(x="t", y="theta", ax=ax)
We take this time series and ask our model to forecast the next step (forecast horizon is 1).
PyTorch Dataset and Lightning DataModule
In our tutorials, we will use Pytorch lightning excessively.
We defined some useful modules in our ts_dl_utils package
and this notebook.
Forecasting with Feedforward Neural Networks¶
Jupyter Notebook Available
We have a notebook for this section which includes all the code used in this section.
Introduction to Neural Networks
We explain the theories of neural networks in this section. Please read it first if you are not familiar with neural networks.
Feedforward neural networks are simple but powerful models for time series forecasting. In this section, we will build a simple feedforward neural network to forecast our pendulum physics data.
Feedforward Neural Network Model¶
We build a feedforward neural network with 5 hidden layers. The input is passed to the first hidden layer, and the output of the first hidden layer is passed to the second hidden layer. The output of the second hidden layer is passed to the third hidden layer, and so on. The output of the last hidden layer is passed to the output layer. The output layer outputs the forecasted values.
flowchart TD
input_layer["Input Layer (100)"]
output_layer["Output Layer (1)"]
subgraph hidden_layers["Hidden Layers"]
hidden_layer_1["Hidden Layer (512)"]
hidden_layer_2["Hidden Layer (256)"]
hidden_layer_3["Hidden Layer (64)"]
hidden_layer_4["Hidden Layer (256)"]
hidden_layer_5["Hidden Layer (512)"]
hidden_layer_1 --> hidden_layer_2
hidden_layer_2 --> hidden_layer_3
hidden_layer_3 --> hidden_layer_4
hidden_layer_4 --> hidden_layer_5
end
input_layer --> hidden_layers
hidden_layers --> output_layer
from typing import Dict, List
import dataclasses
from torch.utils.data import Dataset, DataLoader
from torch import nn
import torch
@dataclasses.dataclass
class TSFFNParams:
"""A dataclass to be served as our parameters for the model.
:param hidden_widths: list of dimensions for the hidden layers
"""
hidden_widths: List[int]
class TSFeedForward(nn.Module):
"""Feedforward networks for univaraite time series modeling.
:param history_length: the length of the input history.
:param horizon: the number of steps to be forecasted.
:param ffn_params: the parameters for the FFN network.
"""
def __init__(
self, history_length: int, horizon: int, ffn_params: TSFFNParams
):
super().__init__()
self.ffn_params = ffn_params
self.history_length = history_length
self.horizon = horizon
self.regulate_input = nn.Linear(
self.history_length, self.ffn_params.hidden_widths[0]
)
self.hidden_layers = nn.Sequential(
*[
self._linear_block(dim_in, dim_out)
for dim_in, dim_out in
zip(
self.ffn_params.hidden_widths[:-1],
self.ffn_params.hidden_widths[1:]
)
]
)
self.regulate_output = nn.Linear(
self.ffn_params.hidden_widths[-1], self.horizon
)
@property
def ffn_config(self) -> Dict:
return dataclasses.asdict(self.ffn_params)
def _linear_block(self, dim_in, dim_out):
return nn.Sequential(*[nn.Linear(dim_in, dim_out), nn.ReLU()])
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = self.regulate_input(x)
x = self.hidden_layers(x)
return self.regulate_output(x)
Results¶
We take 100 time steps as the input history and forecast 1 time step into the future, but with a gap of 10 time steps.

Why the Gap
Since the differences between each steps are tiny, forecasting immediate next step is quite easy. We add a gap to make the forecasting problem a bit harder.
Training
The details for model training can be found in this notebook. We will skip the details but show the loss curve here.

We plotted the forecasts for a test dataset that was held out from training. The forecasts are plotted in red and the ground truth is plotted in green. For a sense of goodness, we also added the naive forecast (forecasting the last observed value) in blue.

The feedforward neural network learned the damped sine wave pattern of the pendulum. To quantify the results, we compute a few metrics.
| Metric | FFN | Naive |
|---|---|---|
| Mean Absolute Error | 0.017704 | 0.092666 |
| Mean Squared Error | 0.000571 | 0.010553 |
| Symmetric Mean Absolute Percentage Error | 0.010806 | 0.050442 |
Since the differences between each time step are small, the naive forecast performs quite well.
Multi-horizon Forecasting¶
We perform a similar experiment but forecast 3 time steps into the future. We plot out some samples. In the plot, the orange shaded regions are the predictions. From these samples, we observe that the forecasts make sense.

To observe the quality of the whole time range, we plot out the first forecast step and the corresponding ground truth. The naive forecast plotted in blue has an obvious shift, while the feedforward neural network plotted in red is much closer to the ground truth.

| Metric | FFN | Naive |
|---|---|---|
| Mean Absolute Error | 0.024640 | 0.109485 |
| Mean Squared Error | 0.001116 | 0.014723 |
| Symmetric Mean Absolute Percentage Error | 0.015637 | 0.059591 |
Forecasting with RNN¶
Jupyter Notebook Available
We have a notebook for this section which includes all the code used in this section.
Introduction to Neural Networks
We explain the theories of neural networks in this section. Please read it first if you are not familiar with neural networks.
In section Recurrent Neural Network we discussed the basics of RNN. In this section, we will build an RNN model to forecast our pendulum time series data.
RNN Model¶
We build an RNN model with an input size of 96, a hidden size of 64 and one single RNN block. We use L1 loss in the trainings.
from typing import Dict
import dataclasses
from torch.utils.data import Dataset, DataLoader
from torch import nn
import torch
@dataclasses.dataclass
class TSRNNParams:
"""A dataclass to be served as our parameters for the model.
:param hidden_size: number of dimensions in the hidden state
:param input_size: input dim
:param num_layers: number of units stacked
"""
input_size: int
hidden_size: int
num_layers: int = 1
class TSRNN(nn.Module):
"""RNN for univaraite time series modeling.
:param history_length: the length of the input history.
:param horizon: the number of steps to be forecasted.
:param rnn_params: the parameters for the RNN network.
"""
def __init__(self, history_length: int, horizon: int, rnn_params: TSRNNParams):
super().__init__()
self.rnn_params = rnn_params
self.history_length = history_length
self.horizon = horizon
self.regulate_input = nn.Linear(
self.history_length, self.rnn_params.input_size
)
self.rnn = nn.RNN(
input_size=self.rnn_params.input_size,
hidden_size=self.rnn_params.hidden_size,
num_layers=self.rnn_params.num_layers,
batch_first=True
)
self.regulate_output = nn.Linear(
self.rnn_params.hidden_size, self.horizon
)
@property
def rnn_config(self) -> Dict:
return dataclasses.asdict(self.rnn_params)
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = self.regulate_input(x)
x, _ = self.rnn(x)
return self.regulate_output(x)
One Step Forecasting¶
Similar to Forecasting with Feedforward Neural Networks, we take 100 time steps as the input history and forecast 1 time step into the future, but with a gap of 10 time steps.
Training
The details for model training can be found in this notebook. We will skip the details but show the loss curve here.

With just a few seconds of training, our RNN model can capture the pattern of the pendulum time series data.

The metrics are listed in the following table.
| Metric | RNN | Naive |
|---|---|---|
| Mean Absolute Error | 0.007229 | 0.092666 |
| Mean Squared Error | 0.000074 | 0.010553 |
| Symmetric Mean Absolute Percentage Error | 0.037245 | 0.376550 |
Multi-Horizon Forecasting¶
We also trained the same model to forecast 3 steps into the future and also with a gap of 10 time steps.
Training
The details for model training can be found in this notebook. We will skip the details but show the loss curve here.

Visualizing a few examples of the forecasts, it looks reasonable in many cases.

Similar to the single step forecast, we visualize a specific time step in the forecasts and comparing it to the ground truths. Here we choose to visualize the second time step in the forecasts.

The metrics are listed in the following table.
| Metric | RNN | Naive |
|---|---|---|
| Mean Absolute Error | 0.006714 | 0.109485 |
| Mean Squared Error | 0.000069 | 0.014723 |
| Symmetric Mean Absolute Percentage Error | 0.032914 | 0.423563 |
Transformers for Time Series Forecasting¶
Jupyter Notebook Available
We have a notebook for this section which includes all the code used in this section.
Introduction to Transformers
We explain the theories of transformers in this section. Please read it first if you are not familiar with transformers.
Transformer is a good candidate for time series forecasting due to its sequence modeling capability12. In this section, we will introduce some basic ideas of transformer-based models for time series forecasting.
Transformer for Univariate Time Series Forecasting¶
We take a simple Univariate time series forecasting task as an example. There are implementations of transformers for multivariate time series forecasting with all sorts of covariates, but we focus on univariate forecasting problem for simplicity.
Dataset¶
In this example, we use the pendulumn physics dataset.
Model¶
We built a naive transformer that only has an encoder. The input is passed to a linear layer to convert the tensor to the shape accepted by the encoder. The tensor is then passed to the encoder. The output of the encoder is passed to another linear layer to convert the tensor to the shape of the output.
flowchart TD
input_linear_layer[Linear Layer for Input]
positional_encoder[Positional Encoder]
encoder[Encoder]
output_linear_layer[Linear Layer for Output]
input_linear_layer --> positional_encoder
positional_encoder --> encoder
encoder --> output_linear_layer
Decoder is Good for Covariates
A decoder in a transformer model is good for capturing future covariates. In our problem, we do not have any covariates at all.
Positional Encoder
In this experiment, we do not include positional encoder as it introduces more complexities but it doesn't help that much in our case3.
Evaluations¶
Training
The details for model training can be found in this notebook. We will skip the details but show the loss curve here.

We trained the model using a history length of 50 and plotted the forecasts for a test dataset that was held out from training. The forecasts are plotted in red and the ground truth is plotted in blue.

The forecasts roughly captured the patterns of the pendulum. To quantify the results, we compute a few metrics.
| Metric | Vanilla Transformer | Naive |
|---|---|---|
| Mean Absolute Error | 0.050232 | 0.092666 |
| Mean Squared Error | 0.003625 | 0.010553 |
| Symmetric Mean Absolute Percentage Error | 0.108245 | 0.376550 |
Multi-horizon Forecasting¶
We perform a similar experiment for multi-horizon forecasting (horizon=3). We plot out some samples. In the plot, the orange-shaded regions are the predictions.

To verify that the forecasts make sense, we also plot out a few samples.

The following is a table of the metrics.
| Metric | Vanilla Transformer | Naive |
|---|---|---|
| Mean Absolute Error | 0.057219 | 0.109485 |
| Mean Squared Error | 0.004241 | 0.014723 |
| Symmetric Mean Absolute Percentage Error | 0.112247 | 0.423563 |
Generalization¶
The vanilla transformer has its limitations. For example, it doesn't capture the correlations between the series that well. There are many variants of transformers that are designed just for time series forecasting456789.
A few forecasting packages implemented transformers for time series forecasting. For example, the neuralforecast package by Nixtla has implemented TFT, Informer, AutoFormer, FEDFormer, and PatchTST, as of November 2023. An alternative is darts. These packages provide documentation and we encourage the reader to check them out for more complicated use cases of transformer-based models.
-
Ahmed S, Nielsen IE, Tripathi A, Siddiqui S, Rasool G, Ramachandran RP. Transformers in time-series analysis: A tutorial. 2022. doi:10.1007/s00034-023-02454-8. ↩
-
Wen Q, Zhou T, Zhang C, Chen W, Ma Z, Yan J et al. Transformers in time series: A survey. 2022.http://arxiv.org/abs/2202.07125. ↩
-
Zhang Y, Jiang Q, Li S, Jin X, Ma X, Yan X. You may not need order in time series forecasting. arXiv [csLG] 2019.http://arxiv.org/abs/1910.09620. ↩
-
Lim B, Arik SO, Loeff N, Pfister T. Temporal fusion transformers for interpretable multi-horizon time series forecasting. 2019.http://arxiv.org/abs/1912.09363. ↩
-
Wu H, Xu J, Wang J, Long M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. 2021.https://github.com/thuml/Autoformer. ↩
-
Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H et al. Informer: Beyond efficient transformer for long sequence time-series forecasting. 2020.http://arxiv.org/abs/2012.07436. ↩
-
Nie Y, Nguyen NH, Sinthong P, Kalagnanam J. A time series is worth 64 words: Long-term forecasting with transformers. 2022.http://arxiv.org/abs/2211.14730. ↩
-
Zhou T, Ma Z, Wen Q, Wang X, Sun L, Jin R. FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. 2022.http://arxiv.org/abs/2201.12740. ↩
-
Liu Y, Hu T, Zhang H, Wu H, Wang S, Ma L et al. ITransformer: Inverted transformers are effective for time series forecasting. 2023.http://arxiv.org/abs/2310.06625. ↩
Forecasting with Convolutional Neural Networks¶
Forecasting with VAE¶
Forecasting with Flow¶
Forecasting with GAN¶
Time Series Forecasting with Neural ODE¶
Jupyter Notebook Available
We have a notebook for this section which includes all the code used in this section.
Introduction to Neural ODE
We explain the theories of neuralode in this section. Please read it first if you are not familiar with neural ode.
In the section Neural ODE, we have introduced the concept of neural ODE. In this section, we will show how to use neural ODE to do time series forecasting.
A Neural ODE Model¶
We built a single hidden layer neural network as the field,
graph TD
input["Input (100)"]
input_layer["Hidden Layer (100)"]
output_layer["Output Layer (100)"]
hidden_layer["Hidden Layer (256)"]
output["Output (1)"]
input --> input_layer
input_layer --> hidden_layer
hidden_layer --> output_layer
output_layer --> output
The model is built using the package called torchdyn 1.
Packages
Apart from the torchdyn package we used here, there is another package called torchdiffeq 2 which is developed by the authors of neural ode.
Single Step Forecasts¶
We trained the model using a history length of 100 and only forecast one step (with a gap of 3 between the input and target). The result is shown below.

Neural ODE is a good forecaster for our pendulum dataset since the pendulum is simply generated by a differential equation. The metrics are also computed and listed below.
| Metric | Neural ODE | Naive |
|---|---|---|
| Mean Absolute Error | 0.003052 | 0.092666 |
| Mean Squared Error | 0.000009 | 0.010553 |
| Symmetric Mean Absolute Percentage Error | 0.021231 | 0.376550 |
Training
The training loss is shown below.

Multi-Step Forecasts¶
We perform a similar experiment but forecast 3 steps.

We plot out some samples and shade the predictions using orange color. The plot below shows that the forecasts are mostly on the right trend.

| Metric | Neural ODE | Naive |
|---|---|---|
| Mean Absolute Error | 0.038421 | 0.109485 |
| Mean Squared Error | 0.001478 | 0.014723 |
| Symmetric Mean Absolute Percentage Error | 0.153392 | 0.423563 |
TimeGrad Using Diffusion Model¶
Rasul et al., (2021) proposed a probabilistic forecasting model using denoising diffusion models.
Autoregressive¶
Multivariate Forecasting Problem
Given an input sequence \(\mathbf x_{t-K: t}\), we forecast \(\mathbf x_{t+1:t+H}\).
See this section for more Time Series Forecasting Tasks.
Notation
We use \(x^0\) to denote the actual time series. The super script \({}^{0}\) will be used to represent the non-diffused values.
To apply the denoising diffusion model in a multivariate forecasting problem, we define our forecasting task as the following autoregressive problem,

At each time step \(t\), we build a denoising diffusion model.
Time Dynamics¶
Note that in the denoising diffusion model, we minimize
The above loss becomes that of the denoising model for a single time step. Explicitly,
Time dynamics can be easily captured by some RNN. To include the time dynamics, we use the RNN state built using the time series data of the previous time step \(\mathbf h_{t-1}\)1
Apart from the usual time dimension \(t\), the autoregressive denoising diffusion model has another dimension to optimize: the diffusion step \(n\) for each time \(t\).
The loss for each time step \(t\) is1
That being said, we just need to minimize \(\mathcal L_t\) for each time step \(t\).
Training Algorithm¶
The input data is sliced into fixed length time series \(\mathbf x_t^0\). Since Eq \eqref{eq:ddpm-loss} shows that a loss can be calculated for arbitrary \(n\) without depending on any previous diffusion steps \(n-1\), the training can be done by both random sampling in \(\mathbf x_t^0\) and \(n\). See Rasul et al. (2021)1.

How to Forecast¶
After training, we obtain the time dynamics encoding \(\mathbf h_T\), with which the denoising steps can be calculated using the reverse process
where \(\mathbf z \sim \mathcal N(\mathbf 0, \mathbf I)\).
For example,
It is Probabilistic¶
The quantiles is calculated by repeating many times for each forecasted time step1.
Code¶
An implementation of the model can be found in the package pytorch-ts 2.
-
Rasul K, Seward C, Schuster I, Vollgraf R. Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting. arXiv [cs.LG]. 2021. Available: http://arxiv.org/abs/2101.12072 ↩↩↩↩
-
Rasul K. PyTorchTS. https://github.com/zalandoresearch/pytorch-ts. ↩
Ended: Time Series Forecasting with Deep Learning
Supplementary ↵
Supplementary Materials¶
In this part, we showcase some supplementary materials such as jupyter notebooks.
Notebooks and Utilities for Tutorials¶
All the notebooks are located in the folder dl/notebooks. To run these notebooks, we need to set up our Python environment first. We use poetry to manage our Python environment. For the argument of this choice, please refer to Engineering Tips.
Install Requirements and Create Jupyter Kernel¶
First of all, we need to install all the requirements,
poetry install
or install certain groups using
poetry install --with notebook,visualization,torch,darts
To create a Jupyter kernel for the notebooks, run
poetry run ipython kernel install --user --name=deep-learning
and a Jupyter kernel named deep-learning will be created.
Utilities¶
We have a few utilities that we use in our tutorials. Most of them are located in the package ts_dl_utils located in the folder dl/notebooks/ts_dl_utils.
In principle, the notebooks we provided should work without installing this package. The package is also installed in the environment if you run poetry install just in case one uses the kernel deep-learning created above to run some personal notebooks located in other folders.
Notebooks ↵
Box-Cox Transformation¶
from typing import Any, Dict
import matplotlib.pyplot as plt
import pandas as pd
from darts import TimeSeries
from darts.dataprocessing.transformers import BoxCox
from darts.datasets import AirPassengersDataset
from darts.utils import statistics as dus
ap_series = AirPassengersDataset().load()
_, ax = plt.subplots(figsize=(10, 6.18))
ap_series.plot(label=f"Air Passenger Original Data", ax=ax)
_, ax = plt.subplots(figsize=(10, 6.18))
boxcox_opt = BoxCox()
ap_boxcox_opt_transformed = boxcox_opt.fit_transform(ap_series)
ap_boxcox_opt_transformed.plot(
label=f"$\lambda={boxcox_opt._fitted_params[0].item():0.3f}$", ax=ax
)
_, ax = plt.subplots(figsize=(10, 6.18))
lmbda = 0.01
boxcox = BoxCox(lmbda=lmbda)
boxcox_transformed = boxcox.fit_transform(ap_series)
boxcox_transformed.plot(label=f"Box-Cox Transformed Data (lambdax={lmbda})", ax=ax)
_, ax = plt.subplots(figsize=(10, 6.18))
for lmbda in [0.01, 0.1, 0.2]:
boxcox_lmbda = BoxCox(lmbda=lmbda)
boxcox_lmbda_transformed = boxcox_lmbda.fit_transform(ap_series)
boxcox_lmbda_transformed.plot(
label=f"Box-Cox Transformed Data (lambda={lmbda})", ax=ax
)
Plot and Check Variance¶
def var_series(series: TimeSeries, window: int = 36) -> TimeSeries:
series_rolling_var = TimeSeries.from_dataframe(
series.pd_dataframe().rolling(window=window).var().dropna()
)
return series_rolling_var
def check_stationary(series: TimeSeries) -> Dict[str, Any]:
return {
"is_stationary": dus.stationarity_tests(series),
"kpss": dus.stationarity_test_kpss(series),
"adf": dus.stationarity_test_adf(series),
}
boxcox_grid = []
lmbdas = [0.01, 0.1]
rolling_window = 12
for idx, lmbda in enumerate(lmbdas):
boxcox_lmbda = BoxCox(lmbda=lmbda)
boxcox_lmbda_transformed = boxcox_lmbda.fit_transform(ap_series)
var_lmbda_series = var_series(boxcox_lmbda_transformed, window=rolling_window)
_, ax = plt.subplots(figsize=(10, 6.18))
var_lmbda_series.plot(
label=f"Variance (Rolling Window={rolling_window}) (Box-Cox lambda={lmbda})",
ax=ax,
)
plt.show()
_, ax = plt.subplots(figsize=(10, 6.18))
var_ap_series = var_series(ap_series, window=rolling_window)
var_ap_series.plot(label=f"Variance (Rolling Window={rolling_window})", ax=ax)
A Dataset Generated by Damped Pendulum¶
In this notebook, we demo a dataset we created to simulate the oscillations of a pendulumn.
from functools import cached_property
from typing import List, Tuple
import lightning as L
import matplotlib as mpl
import matplotlib.animation as animation
import matplotlib.pyplot as plt
import pandas as pd
from torchmetrics import MetricCollection
from torchmetrics.regression import (
MeanAbsoluteError,
MeanAbsolutePercentageError,
MeanSquaredError,
SymmetricMeanAbsolutePercentageError,
)
from ts_dl_utils.datasets.dataset import DataFrameDataset
from ts_dl_utils.datasets.pendulum import Pendulum, PendulumDataModule
from ts_dl_utils.evaluation.evaluator import Evaluator
from ts_dl_utils.naive_forecasters.last_observation import LastObservationForecaster
Data¶
We create a dataset that models a damped pendulum. The pendulum is modelled as a damped harmonic oscillator, i.e.,
$$ \theta(t) = \theta(0) \cos(2 \pi t / p)\exp(-\beta t), $$where $\theta(t)$ is the angle of the pendulum at time $t$. The period $p$ is calculated using
$$ p = 2 \pi \sqrt(L / g), $$with $L$ being the length of the pendulum and $g$ being the surface gravity.
pen = Pendulum(length=200)
df = pd.DataFrame(
pen(num_periods=5, num_samples_per_period=100, initial_angle=1, beta=0.01)
)
Since the damping constant is very small, the data generated is mostly a sin wave.
_, ax = plt.subplots(figsize=(10, 6.18))
df.plot(x="t", y="theta", ax=ax)
PyTorch and Lighting DataModule¶
history_length = 100
horizon = 5
ds = DataFrameDataset(dataframe=df, history_length=history_length, horizon=horizon)
print(
f"""
There were {len(df)} rows in the dataframe\n
We got {len(ds)} data points in the dataset (history length: {history_length}, horizon: {horizon})
"""
)
We can create a LightningDataModule for Lightning. When training/evaluating using Lightning, we only need to pass this object pdm to the training.
pdm = PendulumDataModule(
history_length=history_length, horizon=horizon, dataframe=df[["theta"]]
)
Naive Forecasts¶
prediction_truths = [i[1].squeeze() for i in pdm.predict_dataloader()]
trainer_naive = L.Trainer(precision="64")
lobs_forecaster = LastObservationForecaster(horizon=horizon)
lobs_predictions = trainer_naive.predict(model=lobs_forecaster, datamodule=pdm)
evaluator = Evaluator(step=0)
fig, ax = plt.subplots(figsize=(10, 6.18))
ax.plot(
evaluator.y_true(dataloader=pdm.predict_dataloader()),
"g-",
label="truth",
)
ax.plot(evaluator.y(lobs_predictions), "b-.", label="naive predictions")
plt.legend()
evaluator.metrics(lobs_predictions, pdm.predict_dataloader())
Naive forecaster works well since we do not have dramatic changes between two time steps.
Delayed Embedding¶
ds_de = DataFrameDataset(dataframe=df["theta"][:200], history_length=1, horizon=1)
class DelayedEmbeddingAnimation:
"""Builds an animation for univariate time series
using delayed embedding.
```python
fig, ax = plt.subplots(figsize=(10, 10))
dea = DelayedEmbeddingAnimation(dataset=ds_de, fig=fig, ax=ax)
ani = dea.build(interval=10, save_count=dea.time_steps)
ani.save("results/pendulum_dataset/delayed_embedding_animation.mp4")
```
:param dataset: a PyTorch dataset, input and target should have only length 1
:param fig: figure object from matplotlib
:param ax: axis object from matplotlib
"""
def __init__(
self, dataset: DataFrameDataset, fig: mpl.figure.Figure, ax: mpl.axes.Axes
):
self.dataset = dataset
self.ax = ax
self.fig = fig
@cached_property
def data(self) -> List[Tuple[float, float]]:
return [(i[0][0], i[1][0]) for i in self.dataset]
@cached_property
def x(self):
return [i[0] for i in self.data]
@cached_property
def y(self):
return [i[1] for i in self.data]
def data_gen(self):
for i in self.data:
yield i
def animation_init(self) -> mpl.axes.Axes:
ax.plot(
self.x,
self.y,
)
ax.set_xlim([-1.1, 1.1])
ax.set_ylim([-1.1, 1.1])
ax.set_xlabel("t")
ax.set_ylabel("t+1")
return self.ax
def animation_run(self, data: Tuple[float, float]) -> mpl.axes.Axes:
x, y = data
self.ax.scatter(x, y)
return self.ax
@cached_property
def time_steps(self):
return len(self.data)
def build(self, interval: int = 10, save_count: int = 10):
return animation.FuncAnimation(
self.fig,
self.animation_run,
self.data_gen,
interval=interval,
init_func=self.animation_init,
save_count=save_count,
)
fig, ax = plt.subplots(figsize=(10, 10))
dea = DelayedEmbeddingAnimation(dataset=ds_de, fig=fig, ax=ax)
ani = dea.build(interval=10, save_count=dea.time_steps)
gif_writer = animation.PillowWriter(fps=5, metadata=dict(artist="Lei Ma"), bitrate=100)
ani.save("results/pendulum_dataset/delayed_embedding_animation.gif", writer=gif_writer)
# ani.save("results/pendulum_dataset/delayed_embedding_animation.mp4")
Forecast Reconciliation¶
This is a notebook for the section Hierarchical Time Series Reconciliation.
import re
import matplotlib.pyplot as plt import numpy as np import pandas as pd import seaborn as sns
import sympy as sp
from darts import TimeSeries
from darts.utils.model_selection import train_test_split
from darts.utils.statistics import plot_pacf
sns.reset_orig()
plt.rcParams["figure.figsize"] = (10, 6.18)
print(plt.rcParams.get("figure.figsize"))
Some MinT Matrics¶
This section shows a few examples of the MinT method. We use these examples to interpret how MinT works.
m_l = 3
m_w_diag_elements = tuple(sp.Symbol(f"W_{i}") for i in range(1, m_l + 1))
m_s_ident_diag = np.diag([1] * (m_l - 1)).tolist()
m_w_diag_elements, m_s_ident_diag
class MinTMatrices:
def __init__(self, levels: int):
self.levels = levels
@property
def s(self):
s_ident_diag = np.diag([1] * (self.levels - 1)).tolist()
return sp.Matrix(
[
[1] * (self.levels - 1),
]
+ s_ident_diag
)
@property
def w_diag_elements(self):
return tuple(sp.Symbol(f"W_{i}") for i in range(1, self.levels + 1))
@property
def w(self):
return sp.Matrix(np.diag(self.w_diag_elements).tolist())
@property
def p_left(self):
return sp.Inverse(sp.MatMul(sp.Transpose(self.s), sp.Inverse(self.w), self.s))
@property
def p_right(self):
return sp.MatMul(sp.Transpose(self.s), sp.Inverse(self.w))
@property
def p(self):
return sp.MatMul(self.p_left, self.p_right)
@property
def s_p(self):
return sp.MatMul(self.s, self.p)
@property
def s_p_numerical(self):
return sp.lambdify(self.w_diag_elements, self.s_p)
def visualize_s_p(self, w_elements, ax):
sns.heatmap(self.s_p_numerical(*w_elements), annot=True, cbar=False, ax=ax)
ax.grid(False)
ax.set(xticklabels=[], yticklabels=[])
ax.tick_params(bottom=False, left=False)
ax.set_title(f"$W_{{diag}} = {w_elements}$")
return ax
mtm_3 = MinTMatrices(levels=3)
print(
f"s: {sp.latex(mtm_3.s)}\n"
f"p: {sp.latex(mtm_3.p.as_explicit())}\n"
f"s_p: {sp.latex(mtm_3.s_p.as_explicit())}\n"
)
mtm_3.s
mtm_3.p
mtm_3.s_p.as_explicit()
mtm_3.w_diag_elements
mtm_3.s_p_numerical(1, 2, 3)
w_elements = [(1, 1, 1), (2, 1, 1)]
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(4 * 2, 4))
for idx, w in enumerate(w_elements):
mtm_3.visualize_s_p(w, axes[idx])
fig.show()
mtm_4 = MinTMatrices(levels=4)
print(
f"s: {sp.latex(mtm_4.s)}\n"
f"p: {sp.latex(mtm_4.p.as_explicit())}\n"
f"s_p: {sp.latex(mtm_4.s_p.as_explicit())}\n"
)
w_elements = [(1, 1, 1, 1), (3, 1, 1, 1)]
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(4 * 2, 4))
for idx, w in enumerate(w_elements):
mtm_4.visualize_s_p(w, axes[idx])
fig.show()
mtm_5 = MinTMatrices(levels=5)
print(
f"s: {sp.latex(mtm_5.s)}\n"
f"p: {sp.latex(mtm_5.p.as_explicit())}\n"
f"s_p: {sp.latex(mtm_5.s_p.as_explicit())}\n"
)
w_elements = [(1, 1, 1, 1, 1), (4, 1, 1, 1, 1)]
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(4 * 2, 4))
for idx, w in enumerate(w_elements):
mtm_5.visualize_s_p(w, axes[idx])
fig.show()
Load data¶
We load a small sample of the M5 dataset.
df = pd.read_csv(
"https://github.com/datumorphism/dataset-m5-simplified/raw/main/dataset/m5_store_sales.csv",
index_col="date",
)
df["Total"] = df[["CA", "TX", "WI"]].sum(axis="columns")
df.index = pd.to_datetime(df.index)
re_simple_col = re.compile(r"'(\w{2}_\d{1})'")
df.rename(
columns={
c: re_simple_col.findall(c)[0] for c in df.columns if re_simple_col.findall(c)
},
inplace=True,
)
df.head()
value_columns = df.columns.tolist()
value_columns
hierarchy = {
"CA_1": ["CA"],
"CA_2": ["CA"],
"CA_3": ["CA"],
"CA_4": ["CA"],
"TX_1": ["TX"],
"TX_2": ["TX"],
"TX_3": ["TX"],
"WI_1": ["WI"],
"WI_2": ["WI"],
"WI_3": ["WI"],
"CA": ["Total"],
"TX": ["Total"],
"WI": ["Total"],
}
ts = TimeSeries.from_dataframe(
df, value_cols=value_columns, freq="d", hierarchy=hierarchy
)
ts
Visualize and Validate the Data¶
ts_sample = ts.drop_after(ts.time_index[20])
ts_sample[["CA", "CA_1", "CA_2", "CA_3", "CA_4"]].plot()
ts_sample["CA"].plot(label="CA")
(ts_sample["CA_1"] + ts_sample["CA_2"] + ts_sample["CA_3"] + ts_sample["CA_4"]).plot(
label="CA_1 + CA_2 + CA_3 + CA_4", linestyle="--", color="r"
)
Forecasts¶
We split the dataset into two time series, ts_train and ts_test. We will hold out ts_test from training.
ts.time_index
ts_train, ts_test = ts.split_after(ts.time_index[1863])
ts_train["Total"].plot(label="Train")
ts_test["Total"].plot(label="Test")
We check the partial autocorrelation function to choose some parameters for our models.
plot_pacf(ts_train["Total"])
from darts.models import LightGBMModel
model_params = {"lags": 14, "linear_tree": True, "output_chunk_length": 10}
model = LightGBMModel(**model_params)
model.fit(ts_train)
model.save("lightgbm.pkl")
ts_pred = model.predict(n=len(ts_test))
We check the performance visually for CA. The patterns looks similar but the scales are a bit off.
ca_columns = ["CA", "CA_1", "CA_2", "CA_3", "CA_4"]
ts_test[ca_columns].plot()
ts_pred[ca_columns].plot(linestyle="--")
vis_columns = ["CA_4"]
ts_test[vis_columns].plot()
ts_pred[vis_columns].plot(linestyle="--")
The forecasts are not coherent.
ts_pred["Total"].plot(label="CA")
(ts_pred["CA"] + ts_pred["TX"] + ts_pred["WI"]).plot(
label="CA + TX + WI", linestyle="--", color="r"
)
ts_pred["CA"].plot(label="CA")
(ts_pred["CA_1"] + ts_pred["CA_2"] + ts_pred["CA_3"] + ts_pred["CA_4"]).plot(
label="CA_1 + CA_2 + CA_3 + CA_4", linestyle="--", color="r"
)
Reconciliation¶
from darts.dataprocessing.transformers import MinTReconciliator
reconciliator = MinTReconciliator(method="wls_val")
reconciliator.fit(ts_train)
ts_pred_recon = reconciliator.transform(ts_pred)
ts_pred_recon["Total"].plot(label="CA")
(ts_pred_recon["CA"] + ts_pred_recon["TX"] + ts_pred_recon["WI"]).plot(
label="CA + TX + WI", linestyle="--", color="r"
)
ts_pred_recon["CA"].plot(label="CA")
(
ts_pred_recon["CA_1"]
+ ts_pred_recon["CA_2"]
+ ts_pred_recon["CA_3"]
+ ts_pred_recon["CA_4"]
).plot(label="CA_1 + CA_2 + CA_3 + CA_4", linestyle="--", color="r")
_, ax = plt.subplots(figsize=(10, 6.18))
ca_columns = ["CA", "CA_1", "CA_2", "CA_3", "CA_4"]
ts_test[ca_columns].plot(ax=ax)
ts_pred_recon[ca_columns].plot(linestyle="--", ax=ax)
What Changed¶
ts_pred_recon_shift = ts_pred_recon - ts_pred
_, ax = plt.subplots(figsize=(10, 6.18))
ts_pred_recon_shift[["Total", "CA", "WI", "TX"]].plot(ax=ax)
_, ax = plt.subplots(figsize=(10, 6.18))
ts_pred_recon_shift[ca_columns + ["Total"]].plot(ax=ax)
To see how the predictions are shifted during reconciliation, we plot out the changes from reconciliation as box plots.
ts_pred_recon_shift[ca_columns + ["Total"]].pd_dataframe().plot.box()
ts_pred_recon_shift["CA"].pd_dataframe().plot.box()
ts_pred_recon_shift[["Total", "CA", "TX", "WI"]].pd_dataframe().plot.box(
title="Box Plot for Reconciled - Original Prediction"
)
ts_pred_recon_shift[["Total", "CA", "TX", "WI"]].pd_dataframe()
max(ts_pred.values().max(), ts_pred_recon.values().max())
chart_component = "Total"
chart_max = max(
ts_pred[chart_component].values().max(),
ts_pred_recon[chart_component].values().max(),
)
chart_min = min(
ts_pred[chart_component].values().min(),
ts_pred_recon[chart_component].values().min(),
)
fig, ax = plt.subplots(figsize=(10, 10))
ax.scatter(ts_pred[chart_component].values(), ts_pred_recon[chart_component].values())
ax.plot(np.linspace(chart_min, chart_max), np.linspace(chart_min, chart_max))
chart_component = "CA"
chart_max = max(
ts_pred[chart_component].values().max(),
ts_pred_recon[chart_component].values().max(),
)
chart_min = min(
ts_pred[chart_component].values().min(),
ts_pred_recon[chart_component].values().min(),
)
fig, ax = plt.subplots(figsize=(10, 10))
ax.scatter(ts_pred[chart_component].values(), ts_pred_recon[chart_component].values())
ax.plot(np.linspace(chart_min, chart_max), np.linspace(chart_min, chart_max))
Can Reconciliations Adjust Bias?¶
We create some artificial bias by shifting one of the series down and then perform reconciliations.
This assumes that the reconciliation already learned about the general patterns on different levels, since we only manually shift the predictions only. The training is not touched.
reconciliator_pred_bias = MinTReconciliator(method="wls_val")
df_pred_biased = ts_pred.pd_dataframe().copy()
df_pred_biased["CA_1"] = df_pred_biased["CA_1"] * 0.5
ts_pred_biased = TimeSeries.from_dataframe(df_pred_biased, hierarchy=ts_pred.hierarchy)
ts_pred["CA_1"].plot(label="Original Prediction for CA_1")
ts_pred_biased["CA_1"].plot(label="Manually Shifted Prediction for CA_1")
reconciliator_pred_bias.fit(ts_pred_biased)
ts_pred_biased_recon = reconciliator_pred_bias.transform(ts_pred_biased)
ts_pred["CA_1"].plot(label="Original Prediction for CA_1")
ts_pred_biased["CA_1"].plot(label="Manually Shifted Prediction for CA_1")
ts_pred_biased_recon["CA_1"].plot(label="Reconciled Shifted Prediction for CA_1")
ts_pred_biased_recon["CA"].plot(label="CA")
(
ts_pred_biased_recon["CA_1"]
+ ts_pred_biased_recon["CA_2"]
+ ts_pred_biased_recon["CA_3"]
+ ts_pred_biased_recon["CA_4"]
).plot(label="CA_1 + CA_2 + CA_3 + CA_4", linestyle="--", color="r")
_, ax = plt.subplots(figsize=(10, 6.18))
ca_columns = ["CA", "CA_1", "CA_2", "CA_3", "CA_4"]
ts_test[ca_columns].plot(ax=ax)
ts_pred_biased_recon[ca_columns].plot(linestyle="--", ax=ax)
reconciliator_mint_cov = MinTReconciliator(method="mint_cov")
reconciliator_mint_cov.fit(ts_pred - ts_test)
ts_test[ca_columns].plot()
reconciliator_mint_cov.transform(ts_pred)[ca_columns].plot(linestyle="--")
Tree Basics¶
import json
from typing import List, Literal, Union
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import tree
sns.set()
Generate Data¶
We generate some artificial dataset about whether to go to the office or work from home.
We will use three features, ["health", "weather", "holiday"]. And people are only going to the office, iff
- health=1: not sick,
- weather=1: good weather,
- holiday=0: not holiday.
We use 1 to indicate that we go to the office.
class WFHData:
"""
Generate a dataset about wheter to go to the office.
Go to the office, if and only if
- I am healthy,
- the weather is good,
- not holiday.
Represented in the feature values, this condition is `[1,1,0]`.
However, we also randomize the target value based on `randomize_prob`:
- `randomize_prob=0`: keep perfect data, no randomization
- `randomize_prob=1`: use the wrong target value. The rules are inverted.
```python
wfh = WFHData(length=10)
```
:param length: the number of data points to generate.
:param randomize_prob: the probability of randomizing the target values.
`0` indicates that we keep the perfect target value based on rules.
:param seed: random generator seed.
"""
def __init__(self, length: int, randomize_prob: int = 0, seed: int = 42):
self.randomize_prob = randomize_prob
self.length = length
self.rng = np.random.default_rng(seed)
self.x = self._generate_feature_values()
self.y = self._generate_target_values()
@property
def feature_names(self) -> List[str]:
return ["health", "weather", "holiday"]
@property
def target_names(self) -> List[str]:
return ["go_to_office"]
@property
def feature_dataframe(self) -> pd.DataFrame:
return pd.DataFrame(self.x, columns=self.feature_names)
@property
def target_dataframe(self) -> pd.DataFrame:
return pd.DataFrame(self.y, columns=self.target_names)
def _generate_feature_values(self) -> List[List[Literal[0, 1]]]:
"""Generate the values for the three features
The values can only be either `0` or `1`.
"""
return self.rng.choice([0, 1], (self.length, len(self.feature_names))).tolist()
def _perfect_target(self) -> List[Literal[0, 1]]:
"""Create target value based on rules:
Go to the office, if and only if
- I am healthy,
- the weather is good,
- not holiday.
Represented in the feature values, this condition is `[1,1,0]`.
"""
target = []
for i in self.x:
if i == [1, 1, 0]:
target.append(1)
else:
target.append(0)
return target
@staticmethod
def _randomize_target(y, rng, probability: float) -> Literal[0, 1]:
"""Randomly choose from the current value `y` and its alternative.
For example, if current value of `y=0`, its alternative is `1`.
We will randomly choose from `0` and `1` based on the specified probability.
If `probability=0`, we return the current value, i.e., `0`.
If `probability=0`, we return the alternative value, i.e., `1`.
Otherwise, it is randomly selected based on the probability.
"""
alternative_y = 1 if y == 0 else 0
return rng.choice(
[y, alternative_y], 1, p=(1 - probability, probability)
).item()
def _generate_target_values(self) -> List[Literal[0, 1]]:
"""Generate the target values"""
y = self._perfect_target()
y = [self._randomize_target(i, self.rng, self.randomize_prob) for i in y]
return y
wfh_demo = WFHData(length=10)
wfh_demo.feature_dataframe
wfh_demo.target_dataframe
Decision Tree on Perfect Data¶
wfh_pure = WFHData(length=100, randomize_prob=0)
clf_pure = tree.DecisionTreeClassifier()
clf_pure.fit(wfh_pure.feature_dataframe, wfh_pure.target_dataframe)
fig, ax = plt.subplots(figsize=(15, 15))
tree.plot_tree(clf_pure, feature_names=wfh_pure.feature_names, ax=ax)
ax.set_title("Tree Trained on Perfect Data")
Impure Data¶
wfh_impure = WFHData(length=100, randomize_prob=0.1)
clf_impure = tree.DecisionTreeClassifier(
max_depth=20, min_samples_leaf=1, min_samples_split=0.0001
)
clf_impure.fit(wfh_impure.feature_dataframe, wfh_impure.target_dataframe)
fig, ax = plt.subplots(figsize=(15, 10))
tree.plot_tree(clf_impure, feature_names=wfh_impure.feature_names, ax=ax)
ax.set_title("Tree Trained on Imperfect Data")
Understand Gini Impurity¶
Gini Impurity for 2 possible classes¶
def gini_2(p1: float, p2: float) -> Union[None, float]:
"""Compute the Gini impurity for the two input values."""
if p1 + p2 <= 1:
return p1 * (1 - p1) + p2 * (1 - p2)
else:
return None
gini_2_test_p1 = np.linspace(0, 1, 1001)
gini_2_test_p2 = np.linspace(0, 1, 1001)
gini_2_test_impurity = [
[gini_2(p1, p2) for p1 in gini_2_test_p1] for p2 in gini_2_test_p2
]
df_gini_2_test = pd.DataFrame(
gini_2_test_impurity,
index=[f"{i:0.2f}" for i in gini_2_test_p2],
columns=[f"{i:0.2f}" for i in gini_2_test_p1],
)
fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(df_gini_2_test.loc[::-1,], ax=ax)
ax.set_xlabel("$p_1$")
ax.set_ylabel("$p_2$")
ax.set_title("Gini Impurity for Data with 2 Possible Values")
Gini Impurity for 3 possible classes¶
def gini_3(p1: float, p2: float) -> Union[None, float]:
"""Computes the gini impurity for three potential classes"""
if p1 + p2 <= 1:
return p1 * (1 - p1) + p2 * (1 - p2) + (1 - p1 - p2) * (p1 + p2)
else:
return None
gini_3_test_p1 = np.linspace(0, 1, 1001)
gini_3_test_p2 = np.linspace(0, 1, 1001)
gini_3_test_impurity = [
[gini_3(p1, p2) for p1 in gini_3_test_p1] for p2 in gini_3_test_p2
]
df_gini_3_test = pd.DataFrame(
gini_3_test_impurity,
index=[f"{i:0.2f}" for i in gini_3_test_p2],
columns=[f"{i:0.2f}" for i in gini_3_test_p1],
)
fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(df_gini_3_test.loc[::-1,], ax=ax)
ax.set_xlabel("$p_1$")
ax.set_ylabel("$p_2$")
ax.set_title("Gini Impurity for Data with 3 Possible Values")
Random Forest Playground¶
Outline
- Generate data of specific functions
- Fit the functions using ensemble methods
- Analyze the trees
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import sklearn.tree as _tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import (
SelectFromModel,
SelectKBest,
chi2,
mutual_info_regression,
)
from sklearn.model_selection import (
GridSearchCV,
RandomizedSearchCV,
cross_val_score,
learning_curve,
train_test_split,
validation_curve,
)
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
mpl.rcParams["axes.unicode_minus"] = False
from random import random
import numpy as np
import seaborn as sns
from joblib import dump, load
Model¶
Components¶
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(1000, 4000, 11)]
# Number of features to consider at every split
max_features = ["auto", "sqrt"]
# Maximum number of levels in tree
max_depth = [int(x) for x in range(10, 30, 2)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [0.001, 0.01, 0.02, 0.05, 0.1, 0.2]
# Minimum number of samples required at each leaf node
min_samples_leaf = [10, 20, 30, 40, 50]
# Method of selecting samples for training each tree
bootstrap = [True, False]
rf_random_grid = {
"rf__n_estimators": n_estimators,
# "rf__max_features": max_features,
"rf__max_depth": max_depth,
"rf__min_samples_split": min_samples_split,
"rf__min_samples_leaf": min_samples_leaf,
"rf__bootstrap": bootstrap,
}
rf = RandomForestRegressor(random_state=42, oob_score=True)
##########
pipeline_steps = [
("rf", rf),
]
pipeline = Pipeline(pipeline_steps)
def pred_true_comparison_plot(dataframe, ax, pred_sample=100):
sns.scatterplot(
dataframe, x="x", y="y", ax=ax, label="y", marker=".", ec="face", s=5
)
sns.scatterplot(
dataframe.sample(100),
x="x",
y="y_pred",
ax=ax,
label="y_pred",
marker="+",
s=100,
linewidth=2,
)
return ax
def predictions_each_estimators(x, rf_model):
preds = []
for i in x:
i_preds = []
for est in rf_model.best_estimator_["rf"].estimators_:
i_preds.append(est.predict([[i]]).tolist())
i_preds = sum(i_preds, [])
preds.append(i_preds)
return {"x": pd.DataFrame(x, columns=["x"]), "preds": pd.DataFrame(preds)}
Data without Noise¶
X_sin = [6 * random() for i in range(10000)]
y_sin = np.sin(X_sin)
X_sin_test = [6 * random() for i in range(10000)]
y_sin_test = np.sin(X_sin_test)
df_sin = pd.DataFrame(
{
"x": X_sin,
"y": y_sin,
}
)
model = RandomizedSearchCV(
pipeline, cv=10, param_distributions=rf_random_grid, verbose=3, n_jobs=-1
)
model.fit(df_sin[["x"]], df_sin["y"].values.ravel())
sin_score = model.score(df_sin[["x"]], df_sin["y"].values.ravel())
model.best_params_
# dump(model, "reports/rf_sin.joblib")
fig, ax = plt.subplots(figsize=(10 * 10, 4 * 10))
_tree.plot_tree(model.best_estimator_["rf"].estimators_[0], fontsize=7)
Plot out the result
df_sin["y_pred"] = model.predict(df_sin[["x"]])
df_sin
fig, ax = plt.subplots(figsize=(10, 6.18))
pred_true_comparison_plot(df_sin, ax)
ax.set_title(f"Random Forest on Sin Data; $R^2$ Score: {sin_score:0.2f}")
plt.legend()
Plot out the boxplots of each data point
est_sample_skip = 100
sin_est_pred = predictions_each_estimators(sorted(X_sin_test)[::est_sample_skip], model)
df_sin_est_quantiles = pd.merge(
sin_est_pred["x"],
sin_est_pred["preds"].quantile(q=[0.75, 0.25], axis=1).T,
how="left",
left_index=True,
right_index=True,
)
df_sin_est_quantiles["boxsize"] = (
df_sin_est_quantiles[0.75] - df_sin_est_quantiles[0.25]
)
fig, ax = plt.subplots(figsize=(10, 1.5 * 6.18))
fig_skip = 5
ax.violinplot(
sin_est_pred["preds"].values.tolist()[::fig_skip],
positions=sin_est_pred["x"].x.tolist()[::fig_skip],
)
sns.lineplot(df_sin, x="x", y="y", ax=ax, label="y")
plt.xticks([])
# ax.yaxis.set_major_locator(mpl.ticker.FixedLocator(range(10)))
# ax.set_xticklabels([f"{i:0.2f}" for i in sin_est_pred["x"].x])
ax.set_title(
"Violin Plot for All Predictions of Each Tree in a Random Forest on Some Sin Data Points"
)
fig, ax = plt.subplots(figsize=(10, 2 * 6.18))
fig_skip = 5
ax.boxplot(
sin_est_pred["preds"].values.tolist()[::fig_skip],
positions=sin_est_pred["x"].x.tolist()[::fig_skip],
)
sns.lineplot(df_sin, x="x", y="y", ax=ax, label="y")
ax.set_xticklabels([f"{i:0.2f}" for i in sin_est_pred["x"].x[::fig_skip]])
ax.set_title("Box Plot for Tree Predictions on Random Forest on Sin Data")
df_sin_est_quantiles
fig, ax = plt.subplots(figsize=(10, 6.18))
sns.barplot(df_sin_est_quantiles, x="x", y="boxsize")
ax.set_xticklabels([f"{i:0.2f}" for i in sin_est_pred["x"].x])
fig, ax = plt.subplots(figsize=(10, 6.18))
sns.histplot(df_sin_est_quantiles.boxsize, ax=ax)
ax.set_yscale("log")
ax.set_xscale("log")
Data with Noise¶
X_sin_noise = np.array([6 * random() for i in range(10000)])
y_sin_noise = np.array([i + 0.1 * (random() - 0.5) for i in np.sin(X_sin_noise)])
df_sin_noise = pd.DataFrame({"x": X_sin_noise, "y": y_sin_noise})
fig, ax = plt.subplots(figsize=(10, 6.18))
sns.lineplot(df_sin_noise, x="x", y="y", ax=ax, label="y")
model_noise = RandomizedSearchCV(
pipeline, cv=10, param_distributions=rf_random_grid, verbose=3, n_jobs=-1
)
model_noise.fit(df_sin_noise[["x"]], df_sin_noise["y"].values.ravel())
sin_noise_score = model_noise.score(df_sin_noise[["x"]], df_sin_noise["y"])
model_noise.best_params_
# dump(model_noise, "reports/rf_sin_noise.joblib")
fig, ax = plt.subplots(figsize=(9 * 10, 4 * 10))
_tree.plot_tree(model_noise.best_estimator_["rf"].estimators_[0], fontsize=7)
df_sin_noise["y_pred"] = model_noise.predict(df_sin_noise[["x"]])
fig, ax = plt.subplots(figsize=(10, 6.18))
pred_true_comparison_plot(df_sin_noise, ax)
ax.set_title(
f"Random Forest on Sin Data with Noise; Test $R^2$ Score: {sin_noise_score:0.2f}"
)
plt.legend()
sin_noise_est_pred = predictions_each_estimators(
sorted(X_sin_noise)[::100], model_noise
)
df_sin_noise_est_quantiles = pd.merge(
sin_noise_est_pred["x"],
sin_noise_est_pred["preds"].quantile(q=[0.75, 0.25], axis=1).T,
how="left",
left_index=True,
right_index=True,
)
df_sin_noise_est_quantiles["boxsize"] = (
df_sin_noise_est_quantiles[0.75] - df_sin_noise_est_quantiles[0.25]
)
fig, ax = plt.subplots(figsize=(10, 1.5 * 6.18))
fig_skip = 5
ax.violinplot(
sin_noise_est_pred["preds"].values.tolist()[::fig_skip],
positions=sin_noise_est_pred["x"].x.tolist()[::fig_skip],
)
sns.scatterplot(
df_sin_noise, x="x", y="y", ax=ax, label="y", marker=".", ec="face", s=1
)
plt.xticks([])
ax.set_title(
"Violin Plot for All Predictions of Each Tree in a Random Forest on Some Sin Data Points"
)
fig, ax = plt.subplots(figsize=(10, 2 * 6.18))
fig_skip = 5
ax.boxplot(
sin_noise_est_pred["preds"].values.tolist()[::fig_skip],
positions=sin_noise_est_pred["x"].x.tolist()[::fig_skip],
)
sns.scatterplot(
df_sin_noise, x="x", y="y", ax=ax, label="y", marker=".", ec="face", s=1
)
ax.set_xticklabels([f"{i:0.2f}" for i in sin_noise_est_pred["x"].x[::fig_skip]])
ax.set_title("Box Plot for Tree Predictions on Random Forest on Sin Data")
df_sin_noise_est_quantiles["model"] = "with_noise"
df_sin_est_quantiles["model"] = "no_noise"
df_quantiles = pd.concat(
[
df_sin_est_quantiles[["model", "boxsize"]],
df_sin_noise_est_quantiles[["model", "boxsize"]],
]
)
fig, ax = plt.subplots(figsize=(10, 6.18))
sns.boxplot(df_quantiles, x="boxsize", y="model", ax=ax)
fig, ax = plt.subplots(figsize=(10, 6.18))
sns.histplot(
df_sin_noise_est_quantiles.boxsize,
# bins=20,
ax=ax,
kde=True,
label="with Noise",
stat="probability",
binwidth=0.002,
binrange=(0, 0.07),
)
sns.histplot(
df_sin_est_quantiles.boxsize,
# bins=20,
ax=ax,
kde=True,
label="without Noise",
stat="probability",
binwidth=0.002,
binrange=(0, 0.07),
)
plt.legend()
fig, ax = plt.subplots(figsize=(10, 6.18))
sns.kdeplot(
df_sin_noise_est_quantiles.boxsize,
# bins=20,
ax=ax,
# hist=False,
label="with Noise",
)
sns.kdeplot(
df_sin_est_quantiles.boxsize,
# bins=20,
ax=ax,
# hist=True,
label="without Noise",
)
plt.legend()
Generalization Error¶
def generalization_error(rhob, s):
res = rhob * (1 - s**2) / s**2
if res > 1:
res = 1
return res
generalization_error(0.2, 0.8)
pe_data = [
[generalization_error(rhob, s) for s in np.linspace(0.01, 0.1, 10)]
for rhob in np.linspace(0.01, 0.1, 10)
]
pe_data_s_label = [s for s in np.linspace(0.01, 0.1, 10)]
pe_data_rhob_label = [rhob for rhob in np.linspace(0.01, 0.1, 10)]
fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(pe_data, center=0, ax=ax)
ax.set_xlabel("s")
ax.set_ylabel("correlation")
ax.set_xticklabels([f"{i:0.2f}" for i in pe_data_s_label])
ax.set_yticklabels([f"{i:0.2f}" for i in pe_data_rhob_label])
temp_space = np.linspace(0.1, 1, 91)
pe_data = [[generalization_error(rhob, s) for s in temp_space] for rhob in temp_space]
pe_data_s_label = [s for s in temp_space]
pe_data_rhob_label = [rhob for rhob in temp_space]
fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(pe_data, center=0, ax=ax)
ax.set_xlabel("s")
ax.set_ylabel("correlation")
ax.set_xticklabels([f"{i:0.2f}" for i in (ax.get_xticks() + 0.1) / 100])
ax.set_yticklabels([f"{i:0.2f}" for i in (ax.get_yticks() + 0.1) / 100])
for label in ax.xaxis.get_ticklabels()[::2]:
label.set_visible(False)
for label in ax.yaxis.get_ticklabels()[::2]:
label.set_visible(False)
ax.set_title("Upper Limit of Generalization Error of Random Forest")
from typing import Callable, Dict, List
import darts.utils as du
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from darts import TimeSeries, metrics
from darts.dataprocessing.transformers import BoxCox
from darts.datasets import AirPassengersDataset
from darts.models import LightGBMModel, NaiveDrift, RandomForest
from sklearn.linear_model import LinearRegression
Following the Darts Official Tutorial¶
Darts provides a tutorial here to help the users get started. Here we replicate some of them to provide a minimal working example for tree-based models.
darts_air_passenger_series = AirPassengersDataset().load()
darts_air_passenger_series.plot()
darts_air_passenger_series
From the outputs, we see that the time series dataset contains montly data for 144 months.
train_series_length = 120
test_series_length = len(darts_air_passenger_series) - train_series_length
train_series_length, test_series_length
(
darts_air_passenger_train,
darts_air_passenger_test,
) = darts_air_passenger_series.split_before(train_series_length)
darts_air_passenger_train.plot(label="Training Data")
darts_air_passenger_test.plot(label="Test Data")
First Random Forest Model¶
ap_horizon = len(darts_air_passenger_test)
ap_rf_params = dict(lags=52, output_chunk_length=ap_horizon)
rf_ap = RandomForest(**ap_rf_params)
rf_ap.fit(darts_air_passenger_train)
To observe how the model performs on the training data, we predict a time range that has already seen by the model during training.
darts_air_passenger_train.drop_after(
darts_air_passenger_train.time_index[-ap_horizon]
).plot(label="Prediction Input")
darts_air_passenger_train.drop_before(
darts_air_passenger_train.time_index[-ap_horizon]
).plot(label="True Values")
rf_ap.predict(
n=ap_horizon,
series=darts_air_passenger_train.drop_after(
darts_air_passenger_train.time_index[-ap_horizon]
),
).plot(label="Predictions (In-sample)", linestyle="--")
The predictions looks amazing. However, we all know that tree-based models are not good at out of sample extrapolations. In our case, the trend of the time series may cause some problems. To test this idea, we plot out the predictions for the test date range.
darts_air_passenger_train.plot(label="Train")
darts_air_passenger_test.plot(label="Test")
pred_rf_ap = rf_ap.predict(n=ap_horizon)
pred_rf_ap.plot(label="Prediction", linestyle="--")
Detrending Helps¶
We train the same model but with the detrended dataset, and reconstruct the predictions using the trend. This method demonstrate that detrended data is easier for random forest.
fig, ax = plt.subplots(figsize=(10, 6.18))
sns.histplot(
darts_air_passenger_train.pd_dataframe(),
x="#Passengers",
kde=True,
binwidth=50,
binrange=(0, 700),
label="Training Distribution",
stat="probability",
ax=ax,
)
sns.histplot(
darts_air_passenger_test.pd_dataframe(),
x="#Passengers",
kde=True,
binwidth=50,
binrange=(0, 700),
label="Test Distribution",
stat="probability",
color="r",
ax=ax,
)
ax.set_xlabel("# Passengers")
plt.legend()
(
darts_air_passenger_trend,
darts_air_passenger_seasonal,
) = du.statistics.extract_trend_and_seasonality(
darts_air_passenger_series,
# model=du.utils.ModelMode.ADDITIVE,
# method="STL"
)
darts_air_passenger_series.plot()
darts_air_passenger_trend.plot()
(darts_air_passenger_trend * darts_air_passenger_seasonal).plot()
(
darts_air_passenger_seasonal_train,
darts_air_passenger_seasonal_test,
) = darts_air_passenger_seasonal.split_before(120)
darts_air_passenger_seasonal_train.plot(label="Seasonal Component Train")
darts_air_passenger_seasonal_test.plot(label="Seasonal Component Test")
darts_air_passenger_seasonal_test.pd_dataframe()
fig, ax = plt.subplots(figsize=(10, 6.18))
sns.histplot(
darts_air_passenger_seasonal_train.pd_dataframe(),
x="0",
kde=True,
binwidth=0.1,
binrange=(0.7, 1.3),
label="Training Distribution",
stat="probability",
# fill=False,
ax=ax,
)
sns.histplot(
darts_air_passenger_seasonal_test.pd_dataframe(),
x="0",
kde=True,
binwidth=0.1,
binrange=(0.7, 1.3),
label="Test Distribution",
stat="probability",
color="r",
# fill=False,
ax=ax,
)
ax.set_xlabel("# Passengers")
plt.legend()
rf_ap_seasonal = RandomForest(**ap_rf_params)
rf_ap_seasonal.fit(darts_air_passenger_seasonal_train)
darts_air_passenger_train.plot(label="Train")
darts_air_passenger_test.plot(label="Test")
pred_rf_ap_seasonal = rf_ap_seasonal.predict(
n=ap_horizon
) * darts_air_passenger_trend.drop_before(119)
pred_rf_ap_seasonal.plot(label="Trend * Predicted Seasonal Component", linestyle="--")
This indiates that the performance of trees on out of sample predictions if we only predict on the cycle part of the series. In a real world case, however, we have to predict the trend accurately for this to work. To better reconstruct the trend, there are also tricks like Box-Cox transformations.
Train, Test, and Metrics¶
It is not easy to determine a best model simply looking at the charts. We need some metrics.
air_passenger_boxcox = BoxCox()
darts_air_passenger_train_boxcox = air_passenger_boxcox.fit_transform(
darts_air_passenger_train
)
darts_air_passenger_test_boxcox = air_passenger_boxcox.transform(
darts_air_passenger_test
)
darts_air_passenger_train_boxcox.plot(label="Train (Box-Cox Transformed)")
darts_air_passenger_test_boxcox.plot(label="Test (Box-Cox Transformed)")
def linear_trend_model(series: TimeSeries) -> LinearRegression:
"""Fit a linear trend of the series. This can be used to find the linear
model using training data.
:param series: training timeseries
"""
positional_index_start = 0
series_trend, _ = du.statistics.extract_trend_and_seasonality(series)
model = LinearRegression()
length = len(series_trend)
model.fit(
np.arange(positional_index_start, positional_index_start + length).reshape(
length, 1
),
series_trend.values(),
)
return model
def find_linear_trend(
series: TimeSeries, model, positional_index_start: int = 0
) -> TimeSeries:
"""Using the fitted linear model to find or extrapolate the linear trend.
:param series: train or test timeseries
:param model: LinearRegression model that has `predict` method
:param positional_index_start: the position of the first value in the original timeseries.
"""
length = len(series)
linear_preds = model.predict(
np.arange(positional_index_start, positional_index_start + length).reshape(
length, 1
)
).squeeze()
dataframe = pd.DataFrame(
{"date": series.time_index, "# Passengers": linear_preds}
).set_index("date")
return TimeSeries.from_dataframe(dataframe)
ap_trend_lm = linear_trend_model(darts_air_passenger_train_boxcox)
ap_trend_lm
ap_trend_linear_train = find_linear_trend(
model=ap_trend_lm, series=darts_air_passenger_train_boxcox
)
ap_trend_linear_test = find_linear_trend(
model=ap_trend_lm,
series=darts_air_passenger_test_boxcox,
positional_index_start=train_history_length,
)
darts_air_passenger_train_boxcox.plot(label="Train")
ap_trend_linear_train.plot(label="Linear Trend (Train)")
darts_air_passenger_test_boxcox.plot(label="Test")
ap_trend_linear_test.plot(label="Linear Trend (Test)")
darts_air_passenger_train_transformed = (
darts_air_passenger_train_boxcox - ap_trend_linear_train
)
darts_air_passenger_train_transformed.plot()
rf_bc_lt = RandomForest(**ap_rf_params)
rf_bc_lt.fit(darts_air_passenger_train_transformed)
darts_air_passenger_train.plot()
darts_air_passenger_test.plot()
pred_rf_bc_lt = boxcox.inverse_transform(
rf_bc_lt.predict(n=ap_horizon) + ap_trend_linear_test
)
pred_rf_bc_lt.plot(label="Box-Cox + Linear Detrend Predictions", linestyle="--")
Metrics¶
darts_air_passenger_test.plot(label="Test")
pred_rf_ap.plot(label="Simple RF", linestyle="--")
pred_rf_ap_seasonal.plot(label="RF on Global Detrended Data (Cheating)", linestyle="--")
pred_rf_bc_lt.plot(label="Box-Cox + Linear Detrend", linestyle="--")
benchmark_metrics = [
metrics.mae,
metrics.mape,
metrics.mse,
metrics.rmse,
metrics.smape,
metrics.r2_score,
]
def benchmark_predictions(
series_true: TimeSeries,
series_prediction: TimeSeries,
metrics: List[Callable],
experiment_id: str,
) -> Dict:
results = []
for m in benchmark_metrics:
results.append(
{
"metric": f"{m.__name__}",
"value": m(series_true, series_prediction),
"experiment": experiment_id,
}
)
return results
benchmark_results = []
for i, pred in zip(
["simple_rf", "detrended_cheating", "boxcox_linear_trend"],
[pred_rf_ap, pred_rf_ap_seasonal, pred_rf_bc_lt],
):
benchmark_results += benchmark_predictions(
series_true=darts_air_passenger_test,
series_prediction=pred,
metrics=benchmark_metrics,
experiment_id=i,
)
df_benchmark_metrics = pd.DataFrame(benchmark_results)
df_benchmark_metrics
metric_chart_grid = sns.FacetGrid(
df_benchmark_metrics,
col="metric",
hue="metric",
col_wrap=2,
height=4,
aspect=1 / 0.618,
sharey=False,
)
metric_chart_grid.map(
sns.barplot, "experiment", "value", order=df_benchmark_metrics.experiment.unique()
)
# for axes in metric_chart_grid.axes.flat:
# _ = axes.set_xticklabels(axes.get_xticklabels(), rotation=90)
# metric_chart_grid.fig.tight_layout(w_pad=1)
from typing import Callable, Dict, List
import darts.utils as du
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from darts import TimeSeries, metrics
from darts.dataprocessing.transformers import BoxCox
from darts.datasets import AirPassengersDataset
from darts.models import LightGBMModel, NaiveDrift
from sklearn.linear_model import LinearRegression
Following the Darts Official Tutorial¶
Darts provides a tutorial here to help the users get started. Here we replicate some of them to provide a minimal working example for tree-based models.
darts_air_passenger_series = AirPassengersDataset().load()
darts_air_passenger_series.plot()
darts_air_passenger_series
From the outputs, we see that the time series dataset contains montly data for 144 months.
train_series_length = 120
test_series_length = len(darts_air_passenger_series) - train_series_length
train_series_length, test_series_length
(
darts_air_passenger_train,
darts_air_passenger_test,
) = darts_air_passenger_series.split_before(train_series_length)
darts_air_passenger_train.plot(label="Training Data")
darts_air_passenger_test.plot(label="Test Data")
First Random Forest Model¶
ap_horizon = len(darts_air_passenger_test)
ap_gbdt_params = dict(lags=52, output_chunk_length=ap_horizon)
gbdt_ap = LightGBMModel(**ap_gbdt_params)
gbdt_ap.fit(darts_air_passenger_train)
Insample predictions: We plot out the predictions for the last 24 days in the training data.
darts_air_passenger_train.drop_after(
darts_air_passenger_train.time_index[-ap_horizon]
).plot(label="Prediction Input")
darts_air_passenger_train.drop_before(
darts_air_passenger_train.time_index[-ap_horizon]
).plot(label="True Values")
gbdt_ap.predict(
n=ap_horizon,
series=darts_air_passenger_train.drop_after(
darts_air_passenger_train.time_index[-ap_horizon]
),
).plot(label="Predictions (In-sample)", linestyle="--")
To observe the actual performance, we plot out the predictions of the test dates.
darts_air_passenger_train.plot(label="Train")
darts_air_passenger_test.plot(label="Test")
pred_gbdt_ap = gbdt_ap.predict(n=ap_horizon)
pred_gbdt_ap.plot(label="Prediction", linestyle="--")
Detrending Helps¶
We train the same model but with the detrended dataset, and reconstruct the predictions using the trend.
(
darts_air_passenger_trend,
darts_air_passenger_seasonal,
) = du.statistics.extract_trend_and_seasonality(
darts_air_passenger_series,
# model=du.utils.ModelMode.ADDITIVE,
# method="STL"
)
darts_air_passenger_series.plot()
darts_air_passenger_trend.plot()
(darts_air_passenger_trend * darts_air_passenger_seasonal).plot()
(
darts_air_passenger_seasonal_train,
darts_air_passenger_seasonal_test,
) = darts_air_passenger_seasonal.split_before(120)
darts_air_passenger_seasonal_train.plot(label="Seasonal Component Train")
darts_air_passenger_seasonal_test.plot(label="Seasonal Component Test")
fig, ax = plt.subplots(figsize=(10, 6.18))
sns.histplot(
darts_air_passenger_seasonal_train.pd_dataframe(),
x="0",
kde=True,
binwidth=0.1,
binrange=(0.7, 1.3),
label="Training Distribution",
stat="probability",
# fill=False,
ax=ax,
)
sns.histplot(
darts_air_passenger_seasonal_test.pd_dataframe(),
x="0",
kde=True,
binwidth=0.1,
binrange=(0.7, 1.3),
label="Test Distribution",
stat="probability",
color="r",
# fill=False,
ax=ax,
)
ax.set_xlabel("# Passengers")
plt.legend()
gbdt_ap_seasonal = LightGBMModel(**ap_gbdt_params)
gbdt_ap_seasonal.fit(darts_air_passenger_seasonal_train)
darts_air_passenger_train.plot(label="Train")
darts_air_passenger_test.plot(label="Test")
pred_rf_ap_seasonal = gbdt_ap_seasonal.predict(
n=ap_horizon
) * darts_air_passenger_trend.drop_before(119)
pred_rf_ap_seasonal.plot(label="Trend * Predicted Seasonal Component", linestyle="--")
This indiates that the performance of trees on out of sample predictions if we only predict on the cycle part of the series. In a real world case, however, we have to predict the trend accurately for this to work. To better reconstruct the trend, there are also tricks like Box-Cox transformations.
Train, Test, and Metrics¶
It is not easy to determine a best model simply looking at the charts. We need some metrics.
air_passenger_boxcox = BoxCox()
darts_air_passenger_train_boxcox = air_passenger_boxcox.fit_transform(
darts_air_passenger_train
)
darts_air_passenger_test_boxcox = air_passenger_boxcox.transform(
darts_air_passenger_test
)
darts_air_passenger_train_boxcox.plot(label="Train (Box-Cox Transformed)")
darts_air_passenger_test_boxcox.plot(label="Test (Box-Cox Transformed)")
def linear_trend_model(series: TimeSeries) -> LinearRegression:
"""Fit a linear trend of the series. This can be used to find the linear
model using training data.
:param series: training timeseries
"""
positional_index_start = 0
series_trend, _ = du.statistics.extract_trend_and_seasonality(series)
model = LinearRegression()
length = len(series_trend)
model.fit(
np.arange(positional_index_start, positional_index_start + length).reshape(
length, 1
),
series_trend.values(),
)
return model
def find_linear_trend(
series: TimeSeries, model, positional_index_start: int = 0
) -> TimeSeries:
"""Using the fitted linear model to find or extrapolate the linear trend.
:param series: train or test timeseries
:param model: LinearRegression model that has `predict` method
:param positional_index_start: the position of the first value in the original timeseries.
"""
length = len(series)
linear_preds = model.predict(
np.arange(positional_index_start, positional_index_start + length).reshape(
length, 1
)
).squeeze()
dataframe = pd.DataFrame(
{"date": series.time_index, "# Passengers": linear_preds}
).set_index("date")
return TimeSeries.from_dataframe(dataframe)
ap_trend_lm = linear_trend_model(darts_air_passenger_train_boxcox)
ap_trend_lm
ap_trend_linear_train = find_linear_trend(
model=ap_trend_lm, series=darts_air_passenger_train_boxcox
)
ap_trend_linear_test = find_linear_trend(
model=ap_trend_lm,
series=darts_air_passenger_test_boxcox,
positional_index_start=train_series_length,
)
darts_air_passenger_train_boxcox.plot(label="Train")
ap_trend_linear_train.plot(label="Linear Trend (Train)")
darts_air_passenger_test_boxcox.plot(label="Test")
ap_trend_linear_test.plot(label="Linear Trend (Test)")
darts_air_passenger_train_transformed = (
darts_air_passenger_train_boxcox - ap_trend_linear_train
)
darts_air_passenger_train_transformed.plot()
gbdt_bc_lt = LightGBMModel(**ap_gbdt_params)
gbdt_bc_lt.fit(darts_air_passenger_train_transformed)
darts_air_passenger_train.plot()
darts_air_passenger_test.plot()
pred_gbdt_bc_lt = air_passenger_boxcox.inverse_transform(
gbdt_bc_lt.predict(n=ap_horizon) + ap_trend_linear_test
)
pred_gbdt_bc_lt.plot(label="Box-Cox + Linear Detrend Predictions", linestyle="--")
Linear Tree Horizon¶
Detrending is not the only possibility. LightGBM implements a linear tree version of the base learners.
ap_gbdt_linear_tree_params = dict(
lags=52, output_chunk_length=ap_horizon, linear_tree=True
)
gbdt_linear_tree_ap = LightGBMModel(**ap_gbdt_linear_tree_params)
gbdt_linear_tree_ap.fit(darts_air_passenger_train)
darts_air_passenger_train.plot(label="Train")
darts_air_passenger_test.plot(label="Test")
pred_gbdt_linear_tree_ap = gbdt_linear_tree_ap.predict(n=ap_horizon)
pred_gbdt_linear_tree_ap.plot(label="Linear Tree Prediction", linestyle="--")
Metrics¶
darts_air_passenger_test.plot(label="Test")
pred_gbdt_ap.plot(label="Simple GBDT", linestyle="--")
pred_rf_ap_seasonal.plot(
label="GBDT on Global Detrended Data (Cheating)", linestyle="--"
)
pred_gbdt_bc_lt.plot(label="GBDT on Box-Cox + Linear Detrend Data", linestyle="--")
pred_gbdt_linear_tree_ap.plot(label="Linear Tree|", linestyle="--", color="r")
benchmark_metrics = [
metrics.mae,
metrics.mape,
metrics.mse,
metrics.rmse,
metrics.smape,
metrics.r2_score,
]
def benchmark_predictions(
series_true: TimeSeries,
series_prediction: TimeSeries,
metrics: List[Callable],
experiment_id: str,
) -> Dict:
results = []
for m in benchmark_metrics:
results.append(
{
"metric": f"{m.__name__}",
"value": m(series_true, series_prediction),
"experiment": experiment_id,
}
)
return results
benchmark_results = []
for i, pred in zip(
["simple_gbdt", "detrended_cheating", "boxcox_linear_trend", "linear_tree"],
[pred_gbdt_ap, pred_rf_ap_seasonal, pred_gbdt_bc_lt, pred_gbdt_linear_tree_ap],
):
benchmark_results += benchmark_predictions(
series_true=darts_air_passenger_test,
series_prediction=pred,
metrics=benchmark_metrics,
experiment_id=i,
)
df_benchmark_metrics = pd.DataFrame(benchmark_results)
df_benchmark_metrics
metric_chart_grid = sns.FacetGrid(
df_benchmark_metrics,
col="metric",
hue="metric",
col_wrap=2,
height=4,
aspect=1 / 0.618,
sharey=False,
)
metric_chart_grid.map(
sns.barplot, "experiment", "value", order=df_benchmark_metrics.experiment.unique()
)
# for axes in metric_chart_grid.axes.flat:
# _ = axes.set_xticklabels(axes.get_xticklabels(), rotation=90)
# metric_chart_grid.fig.tight_layout(w_pad=1)
Creating Time Series Datasets¶
In this notebook, we explain how to create a time series dataset for PyTorch using the moving slicing technique.
The class DataFrameDataset is also included in our ts_dl_utils package.
from typing import Tuple
import numpy as np
import pandas as pd
from loguru import logger
from torch.utils.data import Dataset
class DataFrameDataset(Dataset):
"""A dataset from a pandas dataframe.
For a given pandas dataframe, this generates a pytorch
compatible dataset by sliding in time dimension.
```python
ds = DataFrameDataset(
dataframe=df, history_length=10, horizon=2
)
```
:param dataframe: input dataframe with a DatetimeIndex.
:param history_length: length of input X in time dimension
in the final Dataset class.
:param horizon: number of steps to be forecasted.
:param gap: gap between input history and prediction
"""
def __init__(
self, dataframe: pd.DataFrame, history_length: int, horizon: int, gap: int = 0
):
super().__init__()
self.dataframe = dataframe
self.history_length = history_length
self.horzion = horizon
self.gap = gap
self.dataframe_rows = len(self.dataframe)
self.length = (
self.dataframe_rows - self.history_length - self.horzion - self.gap + 1
)
def moving_slicing(self, idx: int, gap: int = 0) -> Tuple[np.ndarray, np.ndarray]:
x, y = (
self.dataframe[idx : self.history_length + idx].values,
self.dataframe[
self.history_length
+ idx
+ gap : self.history_length
+ self.horzion
+ idx
+ gap
].values,
)
return x, y
def _validate_dataframe(self) -> None:
"""Validate the input dataframe.
- We require the dataframe index to be DatetimeIndex.
- This dataset is null aversion.
- Dataframe index should be sorted.
"""
if not isinstance(
self.dataframe.index, pd.core.indexes.datetimes.DatetimeIndex
):
raise TypeError(
"Type of the dataframe index is not DatetimeIndex"
f": {type(self.dataframe.index)}"
)
has_na = self.dataframe.isnull().values.any()
if has_na:
logger.warning("Dataframe has null")
has_index_sorted = self.dataframe.index.equals(
self.dataframe.index.sort_values()
)
if not has_index_sorted:
logger.warning("Dataframe index is not sorted")
def __getitem__(self, idx: int) -> Tuple[np.ndarray, np.ndarray]:
if isinstance(idx, slice):
if (idx.start < 0) or (idx.stop >= self.length):
raise IndexError(f"Slice out of range: {idx}")
step = idx.step if idx.step is not None else 1
return [
self.moving_slicing(i, self.gap)
for i in range(idx.start, idx.stop, step)
]
else:
if idx >= self.length:
raise IndexError("End of dataset")
return self.moving_slicing(idx, self.gap)
def __len__(self) -> int:
return self.length
Examples¶
We create a sample dataframe with one single variable "y"
df = pd.DataFrame(np.arange(15), columns=["y"])
df
history_length=10, horizon=1¶
ds_1 = DataFrameDataset(dataframe=df, history_length=10, horizon=1)
list(ds_1)
history_length=10, horizon=2¶
ds_2 = DataFrameDataset(dataframe=df, history_length=10, horizon=2)
list(ds_2)
history_length=10, horizon=1, gap=1¶
ds_1_gap_1 = DataFrameDataset(dataframe=df, history_length=10, horizon=1, gap=1)
list(ds_1_gap_1)
history_length=10, horizon=1, gap=2¶
ds_1_gap_2 = DataFrameDataset(dataframe=df, history_length=10, horizon=1, gap=2)
list(ds_1_gap_2)
history_length=10, horizon=2, gap=1¶
ds_2_gap_1 = DataFrameDataset(dataframe=df, history_length=10, horizon=2, gap=1)
list(ds_2_gap_1)
history_length=10, horizon=2, gap=2¶
ds_2_gap_2 = DataFrameDataset(dataframe=df, history_length=10, horizon=2, gap=2)
list(ds_2_gap_2)
Feedforward Neural Networks for Univariate Time Series Forecasting¶
In this notebook, we build a feedforward neural network using pytorch to forecast $\sin$ function as a time series.
import dataclasses
import math
import os
from functools import cached_property
from typing import Dict, List, Tuple
import lightning as L
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
from lightning.pytorch.callbacks.early_stopping import EarlyStopping
from loguru import logger
from torch import nn
from torch.utils.data import DataLoader, Dataset
from ts_dl_utils.datasets.pendulum import Pendulum, PendulumDataModule
from ts_dl_utils.evaluation.evaluator import Evaluator
from ts_dl_utils.naive_forecasters.last_observation import LastObservationForecaster
Data¶
We create a dataset that models a damped pendulum. The pendulum is modelled as a damped harmonic oscillator, i.e.,
$$ \theta(t) = \theta(0) \cos(2 \pi t / p)\exp(-\beta t), $$where $\theta(t)$ is the angle of the pendulum at time $t$. The period $p$ is calculated using
$$ p = 2 \pi \sqrt(L / g), $$with $L$ being the length of the pendulum and $g$ being the surface gravity.
pen = Pendulum(length=100)
df = pd.DataFrame(pen(10, 400, initial_angle=1, beta=0.001))
df["theta"] = df["theta"] + 2
Since the damping constant is very small, the data generated is mostly a sin wave.
_, ax = plt.subplots(figsize=(10, 6.18))
df.plot(x="t", y="theta", ax=ax)
Model¶
In this section, we create the FFN model.
@dataclasses.dataclass
class TSFFNParams:
"""A dataclass to be served as our parameters for the model.
:param hidden_widths: list of dimensions for the hidden layers
"""
hidden_widths: List[int]
class TSFeedForward(nn.Module):
"""Feedforward networks for univaraite time series modeling.
:param history_length: the length of the input history.
:param horizon: the number of steps to be forecasted.
:param ffn_params: the parameters for the FFN network.
"""
def __init__(self, history_length: int, horizon: int, ffn_params: TSFFNParams):
super().__init__()
self.ffn_params = ffn_params
self.history_length = history_length
self.horizon = horizon
self.regulate_input = nn.Linear(
self.history_length, self.ffn_params.hidden_widths[0]
)
self.hidden_layers = nn.Sequential(
*[
self._linear_block(dim_in, dim_out)
for dim_in, dim_out in zip(
self.ffn_params.hidden_widths[:-1],
self.ffn_params.hidden_widths[1:],
)
]
)
self.regulate_output = nn.Linear(
self.ffn_params.hidden_widths[-1], self.horizon
)
@property
def ffn_config(self):
return dataclasses.asdict(self.ffn_params)
def _linear_block(self, dim_in, dim_out):
return nn.Sequential(*[nn.Linear(dim_in, dim_out), nn.ReLU()])
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = self.regulate_input(x)
x = self.hidden_layers(x)
return self.regulate_output(x)
Forecasting (horizon=1)¶
We use lightning to train our model.
Training Utilities¶
history_length_1_step = 100
horizon_1_step = 1
gap = 10
We will build a few utilities
- To be able to feed the data into our model, we build a class (
DataFrameDataset) that converts the pandas dataframe into a Dataset for pytorch. - To make the lightning training code simpler, we will build a LightningDataModule (
PendulumDataModule) and a LightningModule (FFNForecaster).
class FFNForecaster(L.LightningModule):
def __init__(self, ffn: nn.Module):
super().__init__()
self.ffn = ffn
def configure_optimizers(self):
optimizer = torch.optim.SGD(self.parameters(), lr=1e-3)
return optimizer
def training_step(self, batch, batch_idx):
x, y = batch
x = x.squeeze().type(self.dtype)
y = y.squeeze(-1).type(self.dtype)
y_hat = self.ffn(x)
loss = nn.functional.mse_loss(y_hat, y)
self.log_dict({"train_loss": loss}, prog_bar=True)
return loss
def validation_step(self, batch, batch_idx):
x, y = batch
x = x.squeeze().type(self.dtype)
y = y.squeeze(-1).type(self.dtype)
y_hat = self.ffn(x)
loss = nn.functional.mse_loss(y_hat, y)
self.log_dict({"val_loss": loss}, prog_bar=True)
return loss
def predict_step(self, batch, batch_idx):
x, y = batch
x = x.squeeze().type(self.dtype)
y = y.squeeze(-1).type(self.dtype)
y_hat = self.ffn(x)
return x, y_hat
def forward(self, x):
x = x.squeeze().type(self.dtype)
return x, self.ffn(x)
Data, Model and Training¶
DataModule¶
pdm_1_step = PendulumDataModule(
history_length=history_length_1_step,
horizon=horizon_1_step,
gap=gap,
dataframe=df[["theta"]],
)
fig, ax = plt.subplots(figsize=(10, 6.18))
pdm_1_step_sample = list(pdm_1_step.train_dataloader())[0]
pdm_1_step_sample_history = pdm_1_step_sample[0][0, ...].squeeze(-1).numpy()
pdm_1_step_sample_target = pdm_1_step_sample[1][0, ...].squeeze(-1).numpy()
pdm_1_step_sample_steps = np.arange(
0, len(pdm_1_step_sample_history) + gap + len(pdm_1_step_sample_target)
)
ax.plot(
pdm_1_step_sample_steps[: len(pdm_1_step_sample_history)],
pdm_1_step_sample_history,
marker=".",
label="Input",
)
ax.plot(
pdm_1_step_sample_steps[len(pdm_1_step_sample_history) + gap :],
pdm_1_step_sample_target,
"r--",
marker="x",
label="Target",
)
plt.legend()
LightningModule¶
ts_ffn_params_1_step = TSFFNParams(hidden_widths=[512, 256, 64, 256, 512])
ts_ffn_1_step = TSFeedForward(
history_length=history_length_1_step,
horizon=horizon_1_step,
ffn_params=ts_ffn_params_1_step,
)
ts_ffn_1_step
ffn_forecaster_1_step = FFNForecaster(ffn=ts_ffn_1_step)
Trainer¶
logger_1_step = L.pytorch.loggers.TensorBoardLogger(
save_dir="lightning_logs", name="ffn_ts_1_step"
)
trainer_1_step = L.Trainer(
precision="64",
max_epochs=100,
min_epochs=5,
callbacks=[
EarlyStopping(monitor="val_loss", mode="min", min_delta=1e-4, patience=2)
],
logger=logger_1_step,
)
Fitting¶
trainer_1_step.fit(model=ffn_forecaster_1_step, datamodule=pdm_1_step)
Retrieving Predictions¶
predictions_1_step = trainer_1_step.predict(
model=ffn_forecaster_1_step, datamodule=pdm_1_step
)
Naive Forecasts¶
To understand how good our forecasts are, we take the last observations in time and use them as forecasts.
ts_LastObservationForecaster is a forecaster we have build for this purpose.
trainer_naive_1_step = L.Trainer(precision="64")
lobs_forecaster_1_step = LastObservationForecaster(horizon=horizon_1_step)
lobs_1_step_predictions = trainer_naive_1_step.predict(
model=lobs_forecaster_1_step, datamodule=pdm_1_step
)
Evaluations¶
evaluator_1_step = Evaluator(step=0)
fig, ax = plt.subplots(figsize=(10, 6.18))
ax.plot(
evaluator_1_step.y_true(dataloader=pdm_1_step.predict_dataloader()),
"g-",
label="truth",
)
ax.plot(evaluator_1_step.y(predictions_1_step), "r--", label="predictions")
ax.plot(evaluator_1_step.y(lobs_1_step_predictions), "b-.", label="naive predictions")
plt.legend()
evaluator_1_step.metrics(predictions_1_step, pdm_1_step.predict_dataloader())
evaluator_1_step.metrics(lobs_1_step_predictions, pdm_1_step.predict_dataloader())
Forecasting (horizon=3)¶
Train a Model¶
history_length_m_step = 100
horizon_m_step = 3
pdm_m_step = PendulumDataModule(
history_length=history_length_m_step,
horizon=horizon_m_step,
dataframe=df[["theta"]],
gap=gap,
)
ts_ffn_params_m_step = TSFFNParams(hidden_widths=[512, 256, 64, 256, 512])
ts_ffn_m_step = TSFeedForward(
history_length=history_length_m_step,
horizon=horizon_m_step,
ffn_params=ts_ffn_params_m_step,
)
ts_ffn_m_step
ffn_forecaster_m_step = FFNForecaster(ffn=ts_ffn_m_step)
logger_m_step = L.pytorch.loggers.TensorBoardLogger(
save_dir="lightning_logs", name="ffn_ts_m_step"
)
trainer_m_step = L.Trainer(
precision="64",
max_epochs=100,
min_epochs=5,
callbacks=[
EarlyStopping(monitor="val_loss", mode="min", min_delta=1e-4, patience=2)
],
logger=logger_m_step,
)
trainer_m_step.fit(model=ffn_forecaster_m_step, datamodule=pdm_m_step)
predictions_m_step = trainer_m_step.predict(
model=ffn_forecaster_m_step, datamodule=pdm_m_step
)
Naive Forecaster¶
trainer_naive_m_step = L.Trainer(precision="64")
lobs_forecaster_m_step = LastObservationForecaster(horizon=horizon_m_step)
lobs_m_step_predictions = trainer_naive_m_step.predict(
model=lobs_forecaster_m_step, datamodule=pdm_m_step
)
Evaluation¶
evaluator_m_step = Evaluator(step=2, gap=gap)
fig, ax = plt.subplots(figsize=(10, 6.18))
ax.plot(
evaluator_m_step.y_true(dataloader=pdm_m_step.predict_dataloader()),
"g-",
label="truth",
)
ax.plot(evaluator_m_step.y(predictions_m_step), "r--", label="predictions")
ax.plot(evaluator_m_step.y(lobs_m_step_predictions), "b-.", label="naive predictions")
plt.legend()
fig, ax = plt.subplots(figsize=(10, 6.18))
for i in np.arange(0, 1000, 120):
evaluator_m_step.plot_one_sample(ax=ax, predictions=predictions_m_step, idx=i)
evaluator_m_step.metrics(predictions_m_step, pdm_m_step.predict_dataloader())
evaluator_m_step.metrics(lobs_m_step_predictions, pdm_m_step.predict_dataloader())
RNN for Univariate Time Series Forecasting¶
In this notebook, we build a RNN using pytorch to forecast $\sin$ function as a time series.
import dataclasses
from functools import cached_property
from typing import Dict, List, Tuple
import lightning as L
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
from lightning.pytorch.callbacks.early_stopping import EarlyStopping
from loguru import logger
from torch import nn
from torch.utils.data import DataLoader, Dataset
from ts_dl_utils.datasets.pendulum import Pendulum, PendulumDataModule
from ts_dl_utils.evaluation.evaluator import Evaluator
from ts_dl_utils.naive_forecasters.last_observation import LastObservationForecaster
Data¶
We create a dataset that models a damped pendulum. The pendulum is modelled as a damped harmonic oscillator, i.e.,
$$ \theta(t) = \theta(0) \cos(2 \pi t / p)\exp(-\beta t), $$where $\theta(t)$ is the angle of the pendulum at time $t$. The period $p$ is calculated using
$$ p = 2 \pi \sqrt(L / g), $$with $L$ being the length of the pendulum and $g$ being the surface gravity.
pen = Pendulum(length=100)
df = pd.DataFrame(pen(10, 400, initial_angle=1, beta=0.001))
Since the damping constant is very small, the data generated is mostly a sin wave.
_, ax = plt.subplots(figsize=(10, 6.18))
df.plot(x="t", y="theta", ax=ax)
Model¶
In this section, we create the RNN model.
@dataclasses.dataclass
class TSRNNParams:
"""A dataclass to be served as our parameters for the model.
:param hidden_size: number of dimensions in the hidden state
:param input_size: input dim
:param num_layers: number of units stacked
"""
input_size: int
hidden_size: int
num_layers: int = 1
class TSRNN(nn.Module):
"""RNN for univaraite time series modeling.
:param history_length: the length of the input history.
:param horizon: the number of steps to be forecasted.
:param rnn_params: the parameters for the RNN network.
"""
def __init__(self, history_length: int, horizon: int, rnn_params: TSRNNParams):
super().__init__()
self.rnn_params = rnn_params
self.history_length = history_length
self.horizon = horizon
self.regulate_input = nn.Linear(self.history_length, self.rnn_params.input_size)
self.rnn = nn.RNN(
input_size=self.rnn_params.input_size,
hidden_size=self.rnn_params.hidden_size,
num_layers=self.rnn_params.num_layers,
batch_first=True,
)
self.regulate_output = nn.Linear(self.rnn_params.hidden_size, self.horizon)
@property
def rnn_config(self):
return dataclasses.asdict(self.rnn_params)
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = self.regulate_input(x)
x, _ = self.rnn(x)
return self.regulate_output(x)
Training¶
We use lightning to train our model.
Training Utilities¶
history_length_1_step = 100
horizon_1_step = 1
gap = 10
We will build a few utilities
- To be able to feed the data into our model, we build a class (
DataFrameDataset) that converts the pandas dataframe into a Dataset for pytorch. - To make the lightning training code simpler, we will build a LightningDataModule (
PendulumDataModule) and a LightningModule (RNNForecaster).
class RNNForecaster(L.LightningModule):
def __init__(self, rnn: nn.Module):
super().__init__()
self.rnn = rnn
def configure_optimizers(self):
optimizer = torch.optim.SGD(self.parameters(), lr=1e-3)
return optimizer
def training_step(self, batch, batch_idx):
x, y = batch
x = x.squeeze().type(self.dtype)
y = y.squeeze(-1).type(self.dtype)
y_hat = self.rnn(x)
loss = nn.functional.l1_loss(y_hat, y)
self.log_dict({"train_loss": loss}, prog_bar=True)
return loss
def validation_step(self, batch, batch_idx):
x, y = batch
x = x.squeeze().type(self.dtype)
y = y.squeeze(-1).type(self.dtype)
y_hat = self.rnn(x)
loss = nn.functional.l1_loss(y_hat, y)
self.log_dict({"val_loss": loss}, prog_bar=True)
return loss
def predict_step(self, batch, batch_idx):
x, y = batch
x = x.squeeze().type(self.dtype)
y = y.squeeze(-1).type(self.dtype)
y_hat = self.rnn(x)
return x, y_hat
def forward(self, x):
x = x.squeeze().type(self.dtype)
return x, self.rnn(x)
Data, Model and Training¶
DataModule¶
pdm_1_step = PendulumDataModule(
history_length=history_length_1_step,
horizon=horizon_1_step,
gap=gap,
dataframe=df[["theta"]],
)
LightningModule¶
ts_rnn_params_1_step = TSRNNParams(input_size=96, hidden_size=64, num_layers=1)
ts_rnn_1_step = TSRNN(
history_length=history_length_1_step,
horizon=horizon_1_step,
rnn_params=ts_rnn_params_1_step,
)
ts_rnn_1_step
rnn_forecaster_1_step = RNNForecaster(rnn=ts_rnn_1_step)
Trainer¶
logger_1_step = L.pytorch.loggers.TensorBoardLogger(
save_dir="lightning_logs", name="rnn_ts_1_step"
)
trainer_1_step = L.Trainer(
precision="64",
max_epochs=100,
min_epochs=5,
callbacks=[
EarlyStopping(monitor="val_loss", mode="min", min_delta=1e-5, patience=2)
],
logger=logger_1_step,
)
Fitting¶
trainer_1_step.fit(model=rnn_forecaster_1_step, datamodule=pdm_1_step)
Retrieving Predictions¶
predictions_1_step = trainer_1_step.predict(
model=rnn_forecaster_1_step, datamodule=pdm_1_step
)
Naive Forecaster¶
trainer_naive_1_step = L.Trainer(precision="64")
lobs_forecaster_1_step = LastObservationForecaster(horizon=horizon_1_step)
lobs_1_step_predictions = trainer_naive_1_step.predict(
model=lobs_forecaster_1_step, datamodule=pdm_1_step
)
Evaluations¶
evaluator_1_step = Evaluator(step=0)
fig, ax = plt.subplots(figsize=(10, 6.18))
ax.plot(
evaluator_1_step.y_true(dataloader=pdm_1_step.predict_dataloader()),
"g-",
label="truth",
)
ax.plot(evaluator_1_step.y(predictions_1_step), "r--", label="predictions")
ax.plot(evaluator_1_step.y(lobs_1_step_predictions), "b-.", label="naive predictions")
plt.legend()
evaluator_1_step.metrics(predictions_1_step, pdm_1_step.predict_dataloader())
evaluator_1_step.metrics(lobs_1_step_predictions, pdm_1_step.predict_dataloader())
Multi-horizon Forecast (h=3)¶
Train a Model¶
history_length_m_step = 100
horizon_m_step = 3
pdm_m_step = PendulumDataModule(
history_length=history_length_m_step,
horizon=horizon_m_step,
dataframe=df[["theta"]],
gap=gap,
)
ts_rnn_params_m_step = TSRNNParams(input_size=96, hidden_size=64, num_layers=1)
ts_rnn_m_step = TSRNN(
history_length=history_length_m_step,
horizon=horizon_m_step,
rnn_params=ts_rnn_params_m_step,
)
ts_rnn_m_step
rnn_forecaster_m_step = RNNForecaster(rnn=ts_rnn_m_step)
logger_m_step = L.pytorch.loggers.TensorBoardLogger(
save_dir="lightning_logs", name="rnn_ts_m_step"
)
trainer_m_step = L.Trainer(
precision="64",
max_epochs=100,
min_epochs=5,
callbacks=[
EarlyStopping(monitor="val_loss", mode="min", min_delta=1e-5, patience=2)
],
logger=logger_m_step,
)
trainer_m_step.fit(model=rnn_forecaster_m_step, datamodule=pdm_m_step)
predictions_m_step = trainer_m_step.predict(
model=rnn_forecaster_m_step, datamodule=pdm_m_step
)
Naive Forecaster¶
trainer_naive_m_step = L.Trainer(precision="64")
lobs_forecaster_m_step = LastObservationForecaster(horizon=horizon_m_step)
lobs_m_step_predictions = trainer_naive_m_step.predict(
model=lobs_forecaster_m_step, datamodule=pdm_m_step
)
Evaluations¶
evaluator_m_step = Evaluator(step=2, gap=gap)
fig, ax = plt.subplots(figsize=(10, 6.18))
ax.plot(
evaluator_m_step.y_true(dataloader=pdm_m_step.predict_dataloader()),
"g-",
label="truth",
)
ax.plot(evaluator_m_step.y(predictions_m_step), "r--", label="predictions")
ax.plot(evaluator_m_step.y(lobs_m_step_predictions), "b-.", label="naive predictions")
plt.legend()
fig, ax = plt.subplots(figsize=(10, 6.18))
for i in np.arange(0, 1000, 120):
evaluator_m_step.plot_one_sample(ax=ax, predictions=predictions_m_step, idx=i)
evaluator_m_step.metrics(predictions_m_step, pdm_m_step.predict_dataloader())
evaluator_m_step.metrics(lobs_m_step_predictions, pdm_m_step.predict_dataloader())
Transformer for Univariate Time Series Forecasting¶
In this notebook, we build a transformer using pytorch to forecast $\sin$ function as a time series.
import dataclasses
import math
import lightning as L
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
from lightning.pytorch.callbacks.early_stopping import EarlyStopping
from torch import nn
from ts_dl_utils.datasets.pendulum import Pendulum, PendulumDataModule
from ts_dl_utils.evaluation.evaluator import Evaluator
from ts_dl_utils.naive_forecasters.last_observation import LastObservationForecaster
Data¶
We create a dataset that models a damped pendulum. The pendulum is modelled as a damped harmonic oscillator, i.e.,
$$ \theta(t) = \theta(0) \cos(2 \pi t / p)\exp(-\beta t), $$where $\theta(t)$ is the angle of the pendulum at time $t$. The period $p$ is calculated using
$$ p = 2 \pi \sqrt(L / g), $$with $L$ being the length of the pendulum and $g$ being the surface gravity.
pen = Pendulum(length=10000)
df = pd.DataFrame(pen(100, 400, initial_angle=1, beta=0.000001))
Since the damping constant is very small, the data generated is mostly a sin wave.
_, ax = plt.subplots(figsize=(10, 6.18))
df.plot(x="t", y="theta", ax=ax)
Model¶
In this section, we create the transformer model.
Since we do not deal with future covariates, we do not need a decoder. In this example, we build a simple transformer that only contains attention in encoder.
@dataclasses.dataclass
class TSTransformerParams:
"""A dataclass that contains all
the parameters for the transformer model.
"""
d_model: int = 512
nhead: int = 8
num_encoder_layers: int = 6
dropout: int = 0.1
class PositionalEncoding(nn.Module):
"""Positional encoding to be added to
input embedding.
:param d_model: hidden dimension of the encoder
:param dropout: rate of dropout
:param max_len: maximum length of our positional
encoder. The encoder can not encode sequence
length longer than max_len.
"""
def __init__(
self,
d_model: int,
dropout: float = 0.1,
max_len: int = 5000,
):
super().__init__()
self.max_len = max_len
self.dropout = nn.Dropout(p=dropout)
position = torch.arange(max_len).unsqueeze(1)
div_term = torch.exp(
torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model)
)
pe = torch.zeros(max_len, d_model)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
self.register_buffer("pe", pe)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
:param x: input embedded time series,
shape `[batch_size, seq_len, embedding_dim]`
"""
history_length = x.size(1)
x = x + self.pe[:history_length]
return self.dropout(x)
class TSTransformer(nn.Module):
"""Transformer for univaraite time series modeling.
:param history_length: the length of the input history.
:param horizon: the number of steps to be forecasted.
:param transformer_params: all the parameters.
"""
def __init__(
self,
history_length: int,
horizon: int,
transformer_params: TSTransformerParams,
):
super().__init__()
self.transformer_params = transformer_params
self.history_length = history_length
self.horizon = horizon
self.embedding = nn.Linear(1, self.transformer_params.d_model)
self.positional_encoding = PositionalEncoding(
d_model=self.transformer_params.d_model
)
encoder_layer = nn.TransformerEncoderLayer(
d_model=self.transformer_params.d_model,
nhead=self.transformer_params.nhead,
batch_first=True,
)
self.encoder = nn.TransformerEncoder(
encoder_layer, num_layers=self.transformer_params.num_encoder_layers
)
self.reverse_embedding = nn.Linear(self.transformer_params.d_model, 1)
self.decoder = nn.Linear(self.history_length, self.horizon)
@property
def transformer_config(self) -> dict:
"""all the param in dict format"""
return dataclasses.asdict(self.transformer_params)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
:param x: input historical time series,
shape `[batch_size, seq_len, n_var]`
"""
x = self.embedding(x)
x = self.positional_encoding(x)
encoder_state = self.encoder(x)
decoder_in = self.reverse_embedding(encoder_state).squeeze(-1)
return self.decoder(decoder_in)
Training¶
We use lightning to train our model.
Training Utilities¶
history_length_1_step = 100
horizon_1_step = 1
gap = 0
We will build a few utilities
- To be able to feed the data into our model, we build a class (
DataFrameDataset) that converts the pandas dataframe into a Dataset for pytorch. - To make the lightning training code simpler, we will build a LightningDataModule (
PendulumDataModule) and a LightningModule (TransformerForecaster).
class TransformerForecaster(L.LightningModule):
"""Transformer forecasting training, validation,
and prediction all collected in one class.
:param transformer: pre-defined transformer model
"""
def __init__(self, transformer: nn.Module):
super().__init__()
self.transformer = transformer
def configure_optimizers(self) -> torch.optim.Optimizer:
optimizer = torch.optim.SGD(self.parameters(), lr=1e-3)
return optimizer
def training_step(self, batch: tuple[torch.Tensor], batch_idx: int) -> torch.Tensor:
x, y = batch
y = y.squeeze(-1).type(self.dtype)
y_hat = self.transformer(x)
loss = nn.functional.mse_loss(y_hat, y)
self.log_dict({"train_loss": loss}, prog_bar=True)
return loss
def validation_step(
self, batch: tuple[torch.Tensor], batch_idx: int
) -> torch.Tensor:
x, y = batch
y = y.squeeze(-1).type(self.dtype)
y_hat = self.transformer(x)
loss = nn.functional.mse_loss(y_hat, y)
self.log_dict({"val_loss": loss}, prog_bar=True)
return loss
def predict_step(
self, batch: list[torch.Tensor], batch_idx: int
) -> tuple[torch.Tensor]:
x, y = batch
y = y.squeeze(-1).type(self.dtype)
y_hat = self.transformer(x)
return x, y_hat
def forward(self, x: torch.Tensor) -> tuple[torch.Tensor]:
return x, self.transformer(x)
Data, Model and Training¶
DataModule¶
pdm_1_step = PendulumDataModule(
history_length=history_length_1_step,
horizon=horizon_1_step,
dataframe=df[["theta"]],
gap=gap,
)
LightningModule¶
ts_transformer_params_1_step = TSTransformerParams(
d_model=192, nhead=6, num_encoder_layers=1
)
ts_transformer_1_step = TSTransformer(
history_length=history_length_1_step,
horizon=horizon_1_step,
transformer_params=ts_transformer_params_1_step,
)
ts_transformer_1_step
transformer_forecaster_1_step = TransformerForecaster(transformer=ts_transformer_1_step)
transformer_forecaster_1_step
Trainer¶
logger_1_step = L.pytorch.loggers.TensorBoardLogger(
save_dir="lightning_logs", name="transformer_ts_1_step"
)
trainer_1_step = L.Trainer(
precision="64",
max_epochs=100,
min_epochs=5,
callbacks=[
EarlyStopping(monitor="val_loss", mode="min", min_delta=1e-7, patience=3)
],
logger=logger_1_step,
)
Fitting¶
demo_x = list(pdm_1_step.train_dataloader())[0][0].type(
transformer_forecaster_1_step.dtype
)
demo_x.shape
nn.Linear(
1,
ts_transformer_1_step.transformer_params.d_model,
dtype=transformer_forecaster_1_step.dtype,
)(demo_x).shape
ts_transformer_1_step.encoder(ts_transformer_1_step.embedding(demo_x)).shape
trainer_1_step.fit(model=transformer_forecaster_1_step, datamodule=pdm_1_step)
Retrieving Predictions¶
predictions_1_step = trainer_1_step.predict(
model=transformer_forecaster_1_step, datamodule=pdm_1_step
)
Naive Forecaster¶
trainer_naive_1_step = L.Trainer(precision="64")
lobs_forecaster_1_step = LastObservationForecaster(horizon=horizon_1_step)
lobs_1_step_predictions = trainer_naive_1_step.predict(
model=lobs_forecaster_1_step, datamodule=pdm_1_step
)
Evaluations¶
evaluator_1_step = Evaluator(step=0)
fig, ax = plt.subplots(figsize=(50, 6.18))
ax.plot(
evaluator_1_step.y_true(dataloader=pdm_1_step.predict_dataloader()),
"g-",
label="truth",
)
ax.plot(evaluator_1_step.y(predictions_1_step), "r--", label="predictions")
ax.plot(evaluator_1_step.y(lobs_1_step_predictions), "b-.", label="naive predictions")
plt.legend()
fig, ax = plt.subplots(figsize=(10, 6.18))
inspection_slice_length = 200
ax.plot(
evaluator_1_step.y_true(dataloader=pdm_1_step.predict_dataloader())[
:inspection_slice_length
],
"g-",
label="truth",
)
ax.plot(
evaluator_1_step.y(predictions_1_step)[:inspection_slice_length],
"r--",
label="predictions",
)
ax.plot(
evaluator_1_step.y(lobs_1_step_predictions)[:inspection_slice_length],
"b-.",
label="naive predictions",
)
plt.legend()
To quantify the results, we compute a few metrics.
pd.merge(
evaluator_1_step.metrics(predictions_1_step, pdm_1_step.predict_dataloader()),
evaluator_1_step.metrics(lobs_1_step_predictions, pdm_1_step.predict_dataloader()),
how="left",
left_index=True,
right_index=True,
suffixes=["_transformer", "_naive"],
)
Here SMAPE is better because of better forecasts for larger values
Forecasting (horizon=3)¶
Train a Model¶
history_length_m_step = 100
horizon_m_step = 3
pdm_m_step = PendulumDataModule(
history_length=history_length_m_step,
horizon=horizon_m_step,
dataframe=df[["theta"]],
gap=gap,
)
ts_transformer_params_m_step = TSTransformerParams(
d_model=192, nhead=6, num_encoder_layers=1
)
ts_transformer_m_step = TSTransformer(
history_length=history_length_m_step,
horizon=horizon_m_step,
transformer_params=ts_transformer_params_m_step,
)
ts_transformer_m_step
transformer_forecaster_m_step = TransformerForecaster(transformer=ts_transformer_m_step)
logger_m_step = L.pytorch.loggers.TensorBoardLogger(
save_dir="lightning_logs", name="transformer_ts_m_step"
)
trainer_m_step = L.Trainer(
precision="64",
max_epochs=100,
min_epochs=5,
callbacks=[
EarlyStopping(monitor="val_loss", mode="min", min_delta=1e-7, patience=3)
],
logger=logger_m_step,
)
trainer_m_step.fit(model=transformer_forecaster_m_step, datamodule=pdm_m_step)
predictions_m_step = trainer_m_step.predict(
model=transformer_forecaster_m_step, datamodule=pdm_m_step
)
Naive Forecaster¶
trainer_naive_m_step = L.Trainer(precision="64")
lobs_forecaster_m_step = LastObservationForecaster(horizon=horizon_m_step)
lobs_m_step_predictions = trainer_naive_m_step.predict(
model=lobs_forecaster_m_step, datamodule=pdm_m_step
)
Evaluations¶
evaluator_m_step = Evaluator(step=2, gap=gap)
fig, ax = plt.subplots(figsize=(10, 6.18))
ax.plot(
evaluator_m_step.y_true(dataloader=pdm_m_step.predict_dataloader()),
"g-",
label="truth",
)
ax.plot(evaluator_m_step.y(predictions_m_step), "r--", label="predictions")
ax.plot(evaluator_m_step.y(lobs_m_step_predictions), "b-.", label="naive predictions")
plt.legend()
fig, ax = plt.subplots(figsize=(10, 6.18))
for i in np.arange(0, 1000, 120):
evaluator_m_step.plot_one_sample(ax=ax, predictions=predictions_m_step, idx=i)
evaluator_m_step.metrics(predictions_m_step, pdm_m_step.predict_dataloader())
evaluator_m_step.metrics(lobs_m_step_predictions, pdm_m_step.predict_dataloader())
NeuralODE for Univariate Time Series Forecasting¶
In this notebook, we build a NeuralODE using pytorch to forecast $\sin$ function as a time series.
import dataclasses
import math
from functools import cached_property
from typing import Dict, List, Tuple
import lightning as L
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
from lightning.pytorch.callbacks.early_stopping import EarlyStopping
from loguru import logger
from torch import nn
from torch.utils.data import DataLoader, Dataset
from torchdyn.core import NeuralODE
from ts_dl_utils.datasets.pendulum import Pendulum, PendulumDataModule
from ts_dl_utils.evaluation.evaluator import Evaluator
from ts_dl_utils.naive_forecasters.last_observation import LastObservationForecaster
Data¶
We create a dataset that models a damped pendulum. The pendulum is modelled as a damped harmonic oscillator, i.e.,
$$ \theta(t) = \theta(0) \cos(2 \pi t / p)\exp(-\beta t), $$where $\theta(t)$ is the angle of the pendulum at time $t$. The period $p$ is calculated using
$$ p = 2 \pi \sqrt(L / g), $$with $L$ being the length of the pendulum and $g$ being the surface gravity.
pen = Pendulum(length=100)
df = pd.DataFrame(pen(10, 400, initial_angle=1, beta=0.001))
Since the damping constant is very small, the data generated is mostly a sin wave.
_, ax = plt.subplots(figsize=(10, 6.18))
df.plot(x="t", y="theta", ax=ax)
Model¶
In this section, we create the NeuralODE model.
@dataclasses.dataclass
class TSNODEParams:
"""A dataclass to be served as our parameters for the model.
:param hidden_widths: list of dimensions for the hidden layers
"""
hidden_widths: List[int]
time_span: torch.Tensor
class TSNODE(nn.Module):
"""NeuralODE for univaraite time series modeling.
:param history_length: the length of the input history.
:param horizon: the number of steps to be forecasted.
:param ffn_params: the parameters for the NODE network.
"""
def __init__(self, history_length: int, horizon: int, model_params: TSNODEParams):
super().__init__()
self.model_params = model_params
self.history_length = history_length
self.horizon = horizon
self.time_span = model_params.time_span
self.regulate_input = nn.Linear(
self.history_length, self.model_params.hidden_widths[0]
)
self.hidden_layers = nn.Sequential(
*[
self._linear_block(dim_in, dim_out)
for dim_in, dim_out in zip(
self.model_params.hidden_widths[:-1],
self.model_params.hidden_widths[1:],
)
]
)
self.regulate_output = nn.Linear(
self.model_params.hidden_widths[-1], self.history_length
)
self.network = nn.Sequential(
*[self.regulate_input, self.hidden_layers, self.regulate_output]
)
@property
def node_config(self):
return dataclasses.asdict(self.ffn_params)
def _linear_block(self, dim_in, dim_out):
return nn.Sequential(*[nn.Linear(dim_in, dim_out), nn.ReLU()])
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.network(x)
Training¶
We use lightning to train our model.
Training Utilities¶
history_length_1_step = 100
horizon_1_step = 1
gap = 10
We will build a few utilities
- To be able to feed the data into our model, we build a class (
DataFrameDataset) that converts the pandas dataframe into a Dataset for pytorch. - To make the lightning training code simpler, we will build a LightningDataModule (
PendulumDataModule) and a LightningModule (FFNForecaster).
class NODEForecaster(L.LightningModule):
def __init__(self, model: nn.Module):
super().__init__()
self.model = model
self.neural_ode = NeuralODE(
self.model.network,
sensitivity="adjoint",
solver="dopri5",
atol_adjoint=1e-4,
rtol_adjoint=1e-4,
)
self.time_span = self.model.time_span
self.horizon = self.model.horizon
def configure_optimizers(self):
optimizer = torch.optim.SGD(self.parameters(), lr=1e-3)
return optimizer
def training_step(self, batch, batch_idx):
x, y = batch
x = x.squeeze(-1).type(self.dtype)
y = y.squeeze(-1).type(self.dtype)
t_, y_hat = self.neural_ode(x, self.time_span)
y_hat = y_hat[-1, ..., -self.horizon :]
loss = nn.functional.mse_loss(y_hat, y)
self.log_dict({"train_loss": loss}, prog_bar=True)
return loss
def validation_step(self, batch, batch_idx):
x, y = batch
x = x.squeeze(-1).type(self.dtype)
y = y.squeeze(-1).type(self.dtype)
t_, y_hat = self.neural_ode(x, self.time_span)
y_hat = y_hat[-1, ..., -self.horizon :]
loss = nn.functional.mse_loss(y_hat, y)
self.log_dict({"val_loss": loss}, prog_bar=True)
return loss
def predict_step(self, batch, batch_idx):
x, y = batch
x = x.squeeze(-1).type(self.dtype)
y = y.squeeze(-1).type(self.dtype)
t_, y_hat = self.neural_ode(x, self.time_span)
y_hat = y_hat[-1, ..., -self.horizon :]
return x, y_hat
def forward(self, x):
x = x.squeeze(-1).type(self.dtype)
t_, y_hat = self.neural_ode(x, self.time_span)
y_hat = y_hat[-1, ..., -self.horizon :]
return x, y_hat
Data, Model and Training¶
DataModule¶
pdm_1_step = PendulumDataModule(
history_length=history_length_1_step,
horizon=horizon_1_step,
dataframe=df[["theta"]],
gap=gap,
)
LightningModule¶
ts_model_params_1_step = TSNODEParams(
hidden_widths=[256], time_span=torch.linspace(0, 1, 101)
)
ts_node_1_step = TSNODE(
history_length=history_length_1_step,
horizon=horizon_1_step,
model_params=ts_model_params_1_step,
)
ts_node_1_step
node_forecaster_1_step = NODEForecaster(model=ts_node_1_step)
Trainer¶
logger_1_step = L.pytorch.loggers.TensorBoardLogger(
save_dir="lightning_logs", name="neuralode_ts_1_step"
)
trainer_1_step = L.Trainer(
precision="32",
max_epochs=10,
min_epochs=5,
callbacks=[
EarlyStopping(monitor="val_loss", mode="min", min_delta=1e-4, patience=2)
],
logger=logger_1_step,
)
Fitting¶
trainer_1_step.fit(model=node_forecaster_1_step, datamodule=pdm_1_step)
Retrieving Predictions¶
predictions_1_step = trainer_1_step.predict(
model=node_forecaster_1_step, datamodule=pdm_1_step
)
Naive Forecasters¶
trainer_naive_1_step = L.Trainer(precision="64")
lobs_forecaster_1_step = LastObservationForecaster(horizon=horizon_1_step)
lobs_1_step_predictions = trainer_naive_1_step.predict(
model=lobs_forecaster_1_step, datamodule=pdm_1_step
)
Evaluations¶
evaluator_1_step = Evaluator(step=0)
fig, ax = plt.subplots(figsize=(10, 6.18))
ax.plot(
evaluator_1_step.y_true(dataloader=pdm_1_step.predict_dataloader()),
"g-",
label="truth",
)
ax.plot(evaluator_1_step.y(predictions_1_step), "r--", label="predictions")
ax.plot(evaluator_1_step.y(lobs_1_step_predictions), "b-.", label="naive predictions")
plt.legend()
To quantify the results, we compute a few metrics.
evaluator_1_step.metrics(predictions_1_step, pdm_1_step.predict_dataloader())
evaluator_1_step.metrics(lobs_1_step_predictions, pdm_1_step.predict_dataloader())
Forecasting (horizon=3)¶
Train a Model¶
history_length_m_step = 100
horizon_m_step = 3
pdm_m_step = PendulumDataModule(
history_length=history_length_m_step,
horizon=horizon_m_step,
dataframe=df[["theta"]],
gap=gap,
)
ts_model_params_m_step = TSNODEParams(
hidden_widths=[256], time_span=torch.linspace(0, 1, 101)
)
ts_node_m_step = TSNODE(
history_length=history_length_m_step,
horizon=horizon_m_step,
model_params=ts_model_params_m_step,
)
ts_node_m_step
node_forecaster_m_step = NODEForecaster(model=ts_node_m_step)
logger_m_step = L.pytorch.loggers.TensorBoardLogger(
save_dir="lightning_logs", name="neuralode_ts_m_step"
)
trainer_m_step = L.Trainer(
precision="32",
max_epochs=10,
min_epochs=5,
callbacks=[
EarlyStopping(monitor="val_loss", mode="min", min_delta=1e-4, patience=2)
],
logger=logger_m_step,
)
trainer_m_step.fit(model=node_forecaster_m_step, datamodule=pdm_m_step)
predictions_m_step = trainer_m_step.predict(
model=node_forecaster_m_step, datamodule=pdm_m_step
)
Naive Forecaster¶
trainer_naive_m_step = L.Trainer(precision="64")
lobs_forecaster_m_step = LastObservationForecaster(horizon=horizon_m_step)
lobs_m_step_predictions = trainer_naive_m_step.predict(
model=lobs_forecaster_m_step, datamodule=pdm_m_step
)
Evaluations¶
evaluator_m_step = Evaluator(step=2, gap=gap)
fig, ax = plt.subplots(figsize=(10, 6.18))
ax.plot(
evaluator_m_step.y_true(dataloader=pdm_m_step.predict_dataloader()),
"g-",
label="truth",
)
ax.plot(evaluator_m_step.y(predictions_m_step), "r--", label="predictions")
ax.plot(evaluator_m_step.y(lobs_m_step_predictions), "b-.", label="naive predictions")
plt.legend()
fig, ax = plt.subplots(figsize=(10, 6.18))
for i in np.arange(0, 1000, 120):
evaluator_m_step.plot_one_sample(ax=ax, predictions=predictions_m_step, idx=i)
evaluator_m_step.metrics(predictions_m_step, pdm_m_step.predict_dataloader())
evaluator_m_step.metrics(lobs_m_step_predictions, pdm_m_step.predict_dataloader())
Time Series Data Generation¶
import numpy as np
import pandas as pd
import plotly.express as px
def profile_sin(t: np.ndarray, lambda_min: float, lambda_max: float) -> np.ndarray:
"""generate a sin wave profile for
the expected number of visitors
in every 10min for each hour during a day
:param t: time in minutes
:param lambda_min: minimum number of visitors
:param lambda_max: maximum number of visitors
"""
amplitude = lambda_max - lambda_min
t_rescaled = (t - t.min()) / t.max() * np.pi
return amplitude * np.sin(t_rescaled) + lambda_min
class KioskVisitors:
"""generate number of visitors for a kiosk store
:param daily_profile: expectations of visitors
in every 10min for each hour during a day
"""
def __init__(self, daily_profile: np.ndarray):
self.daily_profile = daily_profile
self.daily_segments = len(daily_profile)
def __call__(self, n_days: int) -> pd.DataFrame:
"""generate number of visitors for n_days
:param n_days: number of days to generate visitors
"""
visitors = np.concatenate(
[np.random.poisson(self.daily_profile) for _ in range(n_days)]
)
df = pd.DataFrame(
{
"visitors": visitors,
"time": np.arange(len(visitors)),
"expectation": np.tile(self.daily_profile, n_days),
}
)
return df
Create a sin profile
t = np.arange(0, 12 * 60 / 5, 1)
daily_profile = profile_sin(t, lambda_min=0.5, lambda_max=10)
Generate a time series data representing the number of visitors to a Kiosk.
kiosk_visitors = KioskVisitors(daily_profile=daily_profile)
df_visitors = kiosk_visitors(n_days=10)
px.line(
df_visitors,
x="time",
y=["visitors", "expectation"],
)
TimeVAE¶
Use VAE to generate time series data. In this example, we will train a VAE model to sinusoidal time series data. The overall structure of the model is shown below:
mermaid
graph TD
data["Time Series Chunks"] --> E[Encoder]
E --> L[Latent Space]
L --> D[Decoder]
D --> gen["Generated Time Series Chunks"]
import dataclasses
from functools import cached_property
from typing import Dict, List, Tuple
import lightning as L
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
from lightning.pytorch.callbacks.early_stopping import EarlyStopping
from loguru import logger
from torch import nn
from torch.utils.data import DataLoader, Dataset
from ts_dl_utils.datasets.pendulum import Pendulum
Data¶
We will reuse our classic pendulum dataset.
pen = Pendulum(length=100)
df = pd.DataFrame(pen(300, 30, initial_angle=1, beta=0.00001))
df["theta"] = df["theta"] + 2
_, ax = plt.subplots(figsize=(10, 6.18))
df.head(100).plot(x="t", y="theta", ax=ax)
_, ax = plt.subplots(figsize=(10, 6.18))
df.plot(x="t", y="theta", ax=ax)
df
def time_delay_embed(df: pd.DataFrame, window_size: int) -> pd.DataFrame:
"""embed time series into a time delay embedding space
Time column `t` is required in the input data frame.
:param df: original time series data frame
:param window_size: window size for the time delay embedding
"""
dfs_embedded = []
for i in df.rolling(window_size):
i_t = i.t.iloc[0]
dfs_embedded.append(
pd.DataFrame(i.reset_index(drop=True))
.drop(columns=["t"])
.T.reset_index(drop=True)
# .rename(columns={"index": "name"})
# .assign(t=i_t)
)
df_embedded = pd.concat(dfs_embedded[window_size - 1 :])
return df_embedded
time_delay_embed(df, 3)
class TimeVAEDataset(Dataset):
"""A dataset from a pandas dataframe.
For a given pandas dataframe, this generates a pytorch
compatible dataset by sliding in time dimension.
```python
ds = DataFrameDataset(
dataframe=df, history_length=10, horizon=2
)
```
:param dataframe: input dataframe with a DatetimeIndex.
:param window_size: length of time series slicing chunks
"""
def __init__(
self,
dataframe: pd.DataFrame,
window_size: int,
):
super().__init__()
self.dataframe = dataframe
self.window_size = window_size
self.dataframe_rows = len(self.dataframe)
self.length = self.dataframe_rows - self.window_size + 1
def moving_slicing(self, idx: int) -> np.ndarray:
return self.dataframe[idx : self.window_size + idx].values
def _validate_dataframe(self) -> None:
"""Validate the input dataframe.
- We require the dataframe index to be DatetimeIndex.
- This dataset is null aversion.
- Dataframe index should be sorted.
"""
if not isinstance(
self.dataframe.index, pd.core.indexes.datetimes.DatetimeIndex
):
raise TypeError(
"Type of the dataframe index is not DatetimeIndex"
f": {type(self.dataframe.index)}"
)
has_na = self.dataframe.isnull().values.any()
if has_na:
logger.warning("Dataframe has null")
has_index_sorted = self.dataframe.index.equals(
self.dataframe.index.sort_values()
)
if not has_index_sorted:
logger.warning("Dataframe index is not sorted")
def __getitem__(self, idx: int) -> Tuple[np.ndarray, np.ndarray]:
if isinstance(idx, slice):
if (idx.start < 0) or (idx.stop >= self.length):
raise IndexError(f"Slice out of range: {idx}")
step = idx.step if idx.step is not None else 1
return [self.moving_slicing(i) for i in range(idx.start, idx.stop, step)]
else:
if idx >= self.length:
raise IndexError("End of dataset")
return self.moving_slicing(idx)
def __len__(self) -> int:
return self.length
class TimeVAEDataModule(L.LightningDataModule):
"""Lightning DataModule for Time Series VAE.
This data module takes a pandas dataframe and generates
the corresponding dataloaders for training, validation and
testing.
```python
time_vae_dm_example = TimeVAEDataModule(
window_size=30, dataframe=df[["theta"]], batch_size=32
)
```
"""
def __init__(
self,
window_size: int,
dataframe: pd.DataFrame,
test_fraction: float = 0.3,
val_fraction: float = 0.1,
batch_size: int = 32,
num_workers: int = 0,
):
super().__init__()
self.window_size = window_size
self.batch_size = batch_size
self.dataframe = dataframe
self.test_fraction = test_fraction
self.val_fraction = val_fraction
self.num_workers = num_workers
self.train_dataset, self.val_dataset = self.split_train_val(
self.train_val_dataset
)
@cached_property
def df_length(self):
return len(self.dataframe)
@cached_property
def df_test_length(self):
return int(self.df_length * self.test_fraction)
@cached_property
def df_train_val_length(self):
return self.df_length - self.df_test_length
@cached_property
def train_val_dataframe(self):
return self.dataframe.iloc[: self.df_train_val_length]
@cached_property
def test_dataframe(self):
return self.dataframe.iloc[self.df_train_val_length :]
@cached_property
def train_val_dataset(self):
return TimeVAEDataset(
dataframe=self.train_val_dataframe,
window_size=self.window_size,
)
@cached_property
def test_dataset(self):
return TimeVAEDataset(
dataframe=self.test_dataframe,
window_size=self.window_size,
)
def split_train_val(self, dataset: Dataset):
return torch.utils.data.random_split(
dataset, [1 - self.val_fraction, self.val_fraction]
)
def train_dataloader(self):
return DataLoader(
dataset=self.train_dataset,
batch_size=self.batch_size,
shuffle=True,
num_workers=self.num_workers,
persistent_workers=(True if self.num_workers > 0 else False),
)
def test_dataloader(self):
return DataLoader(
dataset=self.test_dataset,
batch_size=self.batch_size,
shuffle=False,
num_workers=self.num_workers,
)
def val_dataloader(self):
return DataLoader(
dataset=self.val_dataset,
batch_size=self.batch_size,
shuffle=False,
num_workers=self.num_workers,
persistent_workers=(True if self.num_workers > 0 else False),
)
def predict_dataloader(self):
return DataLoader(
dataset=self.test_dataset, batch_size=len(self.test_dataset), shuffle=False
)
time_vae_dm_example = TimeVAEDataModule(
window_size=30, dataframe=df[["theta"]], batch_size=32
)
len(list(time_vae_dm_example.train_dataloader()))
list(time_vae_dm_example.train_dataloader())[0].shape
Model¶
@dataclasses.dataclass
class VAEParams:
"""Parameters for VAEEncoder and VAEDecoder
:param hidden_layer_sizes: list of hidden layer sizes
:param latent_size: latent space dimension
:param sequence_length: input sequence length
:param n_features: number of features
"""
hidden_layer_sizes: List[int]
latent_size: int
sequence_length: int
n_features: int = 1
@cached_property
def data_size(self) -> int:
"""The dimension of the input data
when flattened.
"""
return self.sequence_length * self.n_features
def asdict(self) -> dict:
return dataclasses.asdict(self)
class VAEMLPEncoder(nn.Module):
"""MLP Encoder of TimeVAE"""
def __init__(self, params: VAEParams):
super().__init__()
self.params = params
encode_layer_sizes = [self.params.data_size] + self.params.hidden_layer_sizes
self.layers_used_to_encode = [
self._linear_block(size_in, size_out)
for size_in, size_out in zip(
encode_layer_sizes[:-1], encode_layer_sizes[1:]
)
]
self.encode = nn.Sequential(*self.layers_used_to_encode)
encoded_size = self.params.hidden_layer_sizes[-1]
self.z_mean_layer = nn.Linear(encoded_size, self.params.latent_size)
self.z_log_var_layer = nn.Linear(encoded_size, self.params.latent_size)
def forward(
self, x: torch.Tensor
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
batch_size, _, _ = x.size()
x = x.transpose(1, 2)
x = self.encode(x)
z_mean = self.z_mean_layer(x)
z_log_var = self.z_log_var_layer(x)
epsilon = torch.randn(
batch_size, self.params.n_features, self.params.latent_size
).type_as(x)
z = z_mean + torch.exp(0.5 * z_log_var) * epsilon
return z_mean, z_log_var, z
def _linear_block(self, size_in: int, size_out: int) -> nn.Module:
return nn.Sequential(*[nn.Linear(size_in, size_out), nn.ReLU()])
class VAEEncoder(nn.Module):
"""Encoder of TimeVAE
```python
encoder = VAEEncoder(
VAEParams(
hidden_layer_sizes=[40, 30],
latent_size=10,
sequence_length=50
)
)
```
:param params: parameters for the encoder
"""
def __init__(self, params: VAEParams):
super().__init__()
self.params = params
self.hparams = params.asdict()
encode_layer_sizes = [self.params.n_features] + self.params.hidden_layer_sizes
self.layers_used_to_encode = [
self._conv_block(size_in, size_out)
for size_in, size_out in zip(
encode_layer_sizes[:-1], encode_layer_sizes[1:]
)
] + [nn.Flatten()]
self.encode = nn.Sequential(*self.layers_used_to_encode)
encoded_size = self.cal_conv1d_output_dim() * self.params.hidden_layer_sizes[-1]
self.z_mean_layer = nn.Linear(encoded_size, self.params.latent_size)
self.z_log_var_layer = nn.Linear(encoded_size, self.params.latent_size)
def forward(
self, x: torch.Tensor
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
batch_size, _, _ = x.size()
x = x.transpose(1, 2)
x = self.encode(x)
z_mean = self.z_mean_layer(x).view(
batch_size, self.params.n_features, self.params.latent_size
)
z_log_var = self.z_log_var_layer(x).view(
batch_size, self.params.n_features, self.params.latent_size
)
epsilon = torch.randn(
batch_size, self.params.n_features, self.params.latent_size
).type_as(x)
z = z_mean + torch.exp(0.5 * z_log_var) * epsilon
return z_mean, z_log_var, z
def _linear_block(self, size_in: int, size_out: int) -> nn.Module:
return nn.Sequential(*[nn.Linear(size_in, size_out), nn.ReLU()])
def _conv_block(self, size_in: int, size_out: int) -> nn.Module:
return nn.Sequential(
*[
nn.Conv1d(size_in, size_out, kernel_size=3, stride=2, padding=1),
nn.ReLU(),
]
)
def cal_conv1d_output_dim(self) -> int:
"""the output dimension of all the Conv1d layers"""
output_size = self.params.sequence_length * self.params.n_features
for l in self.layers_used_to_encode:
if l._get_name() == "Conv1d":
output_size = self._conv1d_output_dim(l, output_size)
elif l._get_name() == "Sequential":
for l2 in l:
if l2._get_name() == "Conv1d":
output_size = self._conv1d_output_dim(l2, output_size)
return output_size
def _conv1d_output_dim(self, layer: nn.Module, input_size: int) -> int:
"""Formula to calculate
the output size of Conv1d layer
"""
return (
(input_size + 2 * layer.padding[0] - layer.kernel_size[0])
// layer.stride[0]
) + 1
mlp_encoder = VAEMLPEncoder(
VAEParams(hidden_layer_sizes=[40, 30], latent_size=10, sequence_length=50)
)
[i.size() for i in mlp_encoder(torch.ones(32, 50, 1))], mlp_encoder(
torch.ones(32, 50, 1)
)[-1]
encoder = VAEEncoder(
VAEParams(hidden_layer_sizes=[40, 30], latent_size=10, sequence_length=50)
)
[i.size() for i in encoder(torch.ones(32, 50, 1))], encoder(torch.ones(32, 50, 1))[-1]
class VAEDecoder(nn.Module):
"""Decoder of TimeVAE
```python
decoder = VAEDecoder(
VAEParams(
hidden_layer_sizes=[30, 40],
latent_size=10,
sequence_length=50,
)
)
```
:param params: parameters for the decoder
"""
def __init__(self, params: VAEParams):
super().__init__()
self.params = params
self.hparams = params.asdict()
decode_layer_sizes = (
[self.params.latent_size]
+ self.params.hidden_layer_sizes
+ [self.params.data_size]
)
self.decode = nn.Sequential(
*[
self._linear_block(size_in, size_out)
for size_in, size_out in zip(
decode_layer_sizes[:-1], decode_layer_sizes[1:]
)
]
)
def forward(self, z: torch.Tensor) -> torch.Tensor:
output = self.decode(z)
return output.view(-1, self.params.sequence_length, self.params.n_features)
def _linear_block(self, size_in: int, size_out: int) -> nn.Module:
"""create linear block based on the specified sizes"""
return nn.Sequential(*[nn.Linear(size_in, size_out), nn.Softplus()])
decoder = VAEDecoder(
VAEParams(hidden_layer_sizes=[30, 40], latent_size=10, sequence_length=50)
)
decoder(torch.ones(32, 1, 10)).size()
class VAE(nn.Module):
"""VAE model with encoder and decoder
:param encoder: encoder module
:param decoder: decoder module
"""
def __init__(self, encoder: nn.Module, decoder: nn.Module):
super().__init__()
self.encoder = encoder
self.decoder = decoder
self.hparams = {
**{f"encoder_{k}": v for k, v in self.encoder.hparams.items()},
**{f"decoder_{k}": v for k, v in self.decoder.hparams.items()},
}
def forward(
self, x: torch.Tensor
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
z_mean, z_log_var, z = self.encoder(x)
x_reconstructed = self.decoder(z)
return x_reconstructed, z_mean, z_log_var
class VAEModel(L.LightningModule):
"""VAE model using VAEEncoder, VAEDecoder, and VAE
:param model: VAE model
:param reconstruction_weight: weight for the reconstruction loss
:param learning_rate: learning rate for the optimizer
:param scheduler_max_epochs: maximum epochs for the scheduler
"""
def __init__(
self,
model: VAE,
reconstruction_weight: float = 1.0,
learning_rate: float = 1e-3,
scheduler_max_epochs: int = 10000,
):
super().__init__()
self.model = model
self.reconstruction_weight = reconstruction_weight
self.learning_rate = learning_rate
self.scheduler_max_epochs = scheduler_max_epochs
self.hparams.update(
{
**model.hparams,
**{
"reconstruction_weight": reconstruction_weight,
"learning_rate": learning_rate,
"scheduler_max_epochs": scheduler_max_epochs,
},
}
)
self.save_hyperparameters(self.hparams)
def forward(self, x):
return self.model(x)
def training_step(self, batch: torch.Tensor, batch_idx: int) -> torch.Tensor:
batch_reconstructed, z_mean, z_log_var = self.model(batch)
loss_total, loss_reconstruction, loss_kl = self.loss(
x=batch,
x_reconstructed=batch_reconstructed,
z_mean=z_mean,
z_log_var=z_log_var,
)
self.log_dict(
{
"train_loss_total": loss_total,
"train_loss_reconstruction": loss_reconstruction,
"train_loss_kl": loss_kl,
}
)
return loss_total
def validation_step(self, batch: torch.Tensor, batch_idx: int) -> torch.Tensor:
batch_reconstructed, z_mean, z_log_var = self.model(batch)
loss_total, loss_reconstruction, loss_kl = self.loss(
x=batch,
x_reconstructed=batch_reconstructed,
z_mean=z_mean,
z_log_var=z_log_var,
)
self.log_dict(
{
"val_loss_total": loss_total,
"val_loss_reconstruction": loss_reconstruction,
"val_loss_kl": loss_kl,
}
)
return loss_total
def test_step(self, batch: torch.Tensor, batch_idx: int) -> torch.Tensor:
batch_reconstructed, z_mean, z_log_var = self.model(batch)
loss_total, loss_reconstruction, loss_kl = self.loss(
x=batch,
x_reconstructed=batch_reconstructed,
z_mean=z_mean,
z_log_var=z_log_var,
)
self.log_dict(
{
"test_loss_total": loss_total,
"test_loss_reconstruction": loss_reconstruction,
"test_loss_kl": loss_kl,
}
)
return loss_total
def loss(
self,
x: torch.Tensor,
x_reconstructed: torch.Tensor,
z_log_var: torch.Tensor,
z_mean: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
loss_reconstruction = self.reconstruction_loss(x, x_reconstructed)
loss_kl = -0.5 * torch.sum(1 + z_log_var - z_mean**2 - z_log_var.exp())
loss_total = self.reconstruction_weight * loss_reconstruction + loss_kl
return (
loss_total / x.size(0),
loss_reconstruction / x.size(0),
loss_kl / x.size(0),
)
def reconstruction_loss(
self, x: torch.Tensor, x_reconstructed: torch.Tensor
) -> torch.Tensor:
"""Reconstruction loss for VAE.
$$
\sum_{i=1}^{N} (x_i - x_{reconstructed_i})^2
+ \sum_{i=1}^{N} (\mu_i - \mu_{reconstructed_i})^2
$$
"""
loss = torch.sum((x - x_reconstructed) ** 2) + torch.sum(
(torch.mean(x, dim=1) - torch.mean(x_reconstructed, dim=1)) ** 2
)
return loss
def configure_optimizers(self) -> dict:
optimizer = torch.optim.SGD(self.parameters(), lr=self.learning_rate)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=self.scheduler_max_epochs
)
return {
"optimizer": optimizer,
"lr_scheduler": {
"scheduler": scheduler,
"monitor": "train_loss",
"interval": "step",
"frequency": 1,
},
}
Training¶
window_size = 24
max_epochs = 2000
time_vae_dm = TimeVAEDataModule(
window_size=window_size, dataframe=df[["theta"]], batch_size=32
)
vae = VAE(
encoder=VAEEncoder(
VAEParams(
hidden_layer_sizes=[200, 100, 50],
latent_size=8,
sequence_length=window_size,
)
),
decoder=VAEDecoder(
VAEParams(
hidden_layer_sizes=[30, 50, 100], latent_size=8, sequence_length=window_size
)
),
)
vae_model = VAEModel(
vae,
reconstruction_weight=3,
scheduler_max_epochs=max_epochs * len(time_vae_dm.train_dataloader()),
)
trainer = L.Trainer(
precision="64",
max_epochs=max_epochs,
min_epochs=5,
callbacks=[
EarlyStopping(
monitor="val_loss_total", mode="min", min_delta=1e-10, patience=10
)
],
logger=L.pytorch.loggers.TensorBoardLogger(
save_dir="lightning_logs", name="time_vae_naive"
),
)
trainer.fit(model=vae_model, datamodule=time_vae_dm)
Fitted Model¶
IS_RELOAD = True
if IS_RELOAD:
checkpoint_path = "lightning_logs/time_vae_naive/version_29/checkpoints/epoch=1999-step=354000.ckpt"
vae_model_reloaded = VAEModel.load_from_checkpoint(checkpoint_path, model=vae)
else:
vae_model_reloaded = vae_model
for pred_batch in time_vae_dm.predict_dataloader():
print(pred_batch.size())
i_pred = vae_model_reloaded.model(pred_batch.float().cuda())
break
i_pred[0].size()
import matplotlib.pyplot as plt
_, ax = plt.subplots()
element = 4
ax.plot(pred_batch.detach().numpy()[element, :, 0])
ax.plot(i_pred[0].cpu().detach().numpy()[element, :, 0], "x-")
Data generation using the decoder.
sampling_z = torch.randn(
pred_batch.size(0), vae_model_reloaded.model.encoder.params.latent_size
).type_as(vae_model_reloaded.model.encoder.z_mean_layer.weight)
generated_samples_x = (
vae_model_reloaded.model.decoder(sampling_z).cpu().detach().numpy().squeeze()
)
generated_samples_x.size()
_, ax = plt.subplots()
for i in range(min(len(generated_samples_x), 4)):
ax.plot(generated_samples_x[i, :], "x-")
from openTSNE import TSNE
n_tsne_samples = 100
original_samples = pred_batch.cpu().detach().numpy().squeeze()[:n_tsne_samples]
original_samples.shape
tsne = TSNE(
perplexity=30,
metric="euclidean",
n_jobs=8,
random_state=42,
verbose=True,
)
original_samples_embedding = tsne.fit(original_samples)
generated_samples_x[:n_tsne_samples]
generated_samples_embedding = original_samples_embedding.transform(
generated_samples_x[:n_tsne_samples]
)
fig, ax = plt.subplots(figsize=(7, 7))
ax.scatter(
original_samples_embedding[:, 0],
original_samples_embedding[:, 1],
color="black",
marker=".",
label="original",
)
ax.scatter(
generated_samples_embedding[:, 0],
generated_samples_embedding[:, 1],
color="red",
marker="x",
label="generated",
)
ax.set_title("t-SNE of original and generated samples")
ax.set_xlabel("t-SNE 1")
ax.set_ylabel("t-SNE 2")
Comparing Time Series with Each Other¶
Time series data involves a time dimension, and it is not that intuitive to see the difference between two time series. In this notebook, We will show you how to compare time series with each other.
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from dtaidistance import dtw
from dtaidistance import dtw_visualisation as dtwvis
sns.set_theme()
plt.rcParams.update(
{
"font.size": 18, # General font size
"axes.titlesize": 20, # Title font size
"axes.labelsize": 16, # Axis label font size
"xtick.labelsize": 14, # X-axis tick label font size
"ytick.labelsize": 14, # Y-axis tick label font size
"legend.fontsize": 14, # Legend font size
"figure.titlesize": 20, # Figure title font size
}
)
DTW¶
To illustrate how DTW can be used to compare time series, we will use the following datasets:
t = np.arange(0, 20, 0.1)
ts_original = np.sin(t)
We apply different transformations to the original time series.
ts_shifted = np.roll(ts_original, 10)
ts_jitter = ts_original + np.random.normal(0, 0.1, len(ts_original))
ts_flipped = ts_original[::-1]
ts_shortened = ts_original[::2]
ts_raise_level = ts_original + 0.5
ts_outlier = ts_original + np.append(np.zeros(len(ts_original) - 1), [10])
df = pd.DataFrame(
{
"t": t,
"original": ts_original,
"shifted": ts_shifted,
"jitter": ts_jitter,
"flipped": ts_flipped,
"shortened": np.pad(
ts_shortened, (0, len(ts_original) - len(ts_shortened)), constant_values=0
),
"raise_level": ts_raise_level,
"outlier": ts_outlier,
}
)
_, ax = plt.subplots()
for s in df.columns[1:]:
sns.lineplot(df, x="t", y=s, ax=ax, label=s)
distances = {
"series": df.columns[1:],
}
for s in df.columns[1:]:
distances["dtw"] = distances.get("dtw", []) + [dtw.distance(df.original, df[s])]
distances["euclidean"] = distances.get("euclidean", []) + [
np.linalg.norm(df.original - df[s])
]
_, ax = plt.subplots(figsize=(10, 6.18 * 2), nrows=2)
pd.DataFrame(distances).set_index("series").plot.bar(ax=ax[0])
colors = sns.color_palette("husl", len(distances["series"]))
pd.DataFrame(distances).plot.scatter(x="dtw", y="euclidean", ax=ax[1], c=colors, s=100)
for i, txt in enumerate(distances["series"]):
ax[1].annotate(txt, (distances["dtw"][i], distances["euclidean"][i]), fontsize=12)
ax[1].legend(distances["series"], loc="best")
def dtw_map(s1, s2, window=None):
if window is None:
window = len(s1)
d, paths = dtw.warping_paths(s1, s2, window=window, psi=2)
best_path = dtw.best_path(paths)
return dtwvis.plot_warpingpaths(s1, s2, paths, best_path)
dtw_map(df.original, df.jitter)
for s in df.columns[1:]:
fig, ax = dtw_map(df.original, df[s])
fig.suptitle(s, y=1.05)
Dimension Reduction¶
We embed the original time series into a time-delayed embedding space, then reduce the dimensionality of the embedded time series for visualizations.
def time_delay_embed(df: pd.DataFrame, window_size: int) -> pd.DataFrame:
"""embed time series into a time delay embedding space
Time column `t` is required in the input data frame.
:param df: original time series data frame
:param window_size: window size for the time delay embedding
"""
dfs_embedded = []
for i in df.rolling(window_size):
i_t = i.t.iloc[0]
dfs_embedded.append(
pd.DataFrame(i.reset_index(drop=True))
.drop(columns=["t"])
.T.reset_index()
.rename(columns={"index": "name"})
.assign(t=i_t)
)
df_embedded = pd.concat(dfs_embedded[window_size - 1 :])
return df_embedded
df_embedded_2 = time_delay_embed(df, window_size=2)
_, ax = plt.subplots()
(
df_embedded_2.loc[df_embedded_2.name == "original"].plot.line(
x=0, y=1, ax=ax, legend=False
)
)
(
df_embedded_2.loc[df_embedded_2.name == "original"].plot.scatter(
x=0, y=1, c="t", colormap="viridis", ax=ax
)
)
We choose a higher window size to track longer time dependency and for the dimension reduction methods to function properly.
df_embedded = time_delay_embed(df, window_size=5)
PCA¶
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
df_embedded_pca = pd.concat(
[
pd.DataFrame(
pca.fit_transform(
df_embedded.loc[df_embedded.name == n].drop(columns=["name", "t"])
),
columns=["pca_0", "pca_1"],
).assign(name=n)
for n in df_embedded.name.unique()
]
)
sns.scatterplot(data=df_embedded_pca, x="pca_0", y="pca_1", hue="name")
_, ax = plt.subplots()
sns.scatterplot(
data=df_embedded_pca.loc[
df_embedded_pca.name.isin(
["original", "jitter", "flipped", "raise_level", "shifted", "shortened"]
)
],
x="pca_0",
y="pca_1",
hue="name",
style="name",
ax=ax,
)
ax.legend(loc="lower center", bbox_to_anchor=(0.5, -0.4), ncol=3)
t-SNE¶
from sklearn.manifold import TSNE
t_sne = TSNE(n_components=2, learning_rate="auto", init="random", perplexity=3)
df_embedded.name.unique()
df_embedded_tsne = pd.concat(
[
pd.DataFrame(
t_sne.fit_transform(
df_embedded.loc[df_embedded.name == n].drop(columns=["name", "t"])
),
columns=["tsne_0", "tsne_1"],
).assign(name=n)
for n in df_embedded.name.unique()
]
)
df_embedded_tsne.loc[df_embedded_tsne.name == "original"]
sns.scatterplot(data=df_embedded_tsne, x="tsne_0", y="tsne_1", hue="name")
sns.scatterplot(
data=df_embedded_tsne.loc[
df_embedded_tsne.name.isin(["original", "jitter", "outlier"])
],
x="tsne_0",
y="tsne_1",
hue="name",
)
import dataclasses
from functools import cached_property
import numpy as np
import pandas as pd
import plotly.express as px
import torch
import torch.nn as nn
from loguru import logger
@dataclasses.dataclass
class DiffusionPocessParams:
"""Parameter that defines a diffusion process.
:param steps: Number of steps in the diffusion process.
:param beta: Beta parameter for the diffusion process.
"""
steps: int
beta: float
@cached_property
def alpha(self) -> float:
r"""$\alpha = 1 - \beta$"""
return 1.0 - self.beta
@cached_property
def beta_by_step(self) -> np.ndarray:
"""the beta parameter for each step
in the diffusion process.
"""
return np.array([self.beta] * self.steps)
@cached_property
def alpha_by_step(self) -> np.ndarray:
"""the alpha parameter for each step
in the diffusion process."""
return np.array([self.alpha] * self.steps)
class DiffusionProcess:
"""
Diffusion process.
:param params: DiffusionParams that defines
how the diffusion process works
:param noise: noise tensor,
shape is (batch_size, params.steps)
"""
def __init__(
self,
params: DiffusionPocessParams,
noise: torch.Tensor,
dtype: torch.dtype = torch.float32,
):
self.params = params
self.noise = noise
self.dtype = dtype
@cached_property
def alpha_by_step(self) -> torch.Tensor:
"""The alpha parameter for each step
in the diffusion process.
"""
return torch.tensor(self.params.alpha_by_step, dtype=self.dtype)
def _forward_process_by_step(self, state: torch.Tensor, step: int) -> torch.Tensor:
r"""Assuming that we know
the noise at step $t$,
$$
x(t) = \sqrt{\alpha(t)}x(t-1)
+ \sqrt{1 - \alpha(t)}\epsilon(t)
$$
:param state: The state at step $t-1$.
:param step: The current step $t$.
:return: The state at step $t$.
"""
return (
torch.sqrt(self.alpha_by_step[step]) * state
+ torch.sqrt(1 - self.alpha_by_step[step]) * self.noise[:, step]
)
def _inverse_process_by_step(self, state: torch.Tensor, step: int) -> torch.Tensor:
r"""Assuming that we know
the noise at step $t$,
$$
x(t-1) = \frac{1}{\sqrt{\alpha(t)}}
(x(t) - \sqrt{1 - \alpha(t)}\epsilon(t))
$$
"""
return (
state - torch.sqrt(1 - self.alpha_by_step[step]) * self.noise[:, step]
) / torch.sqrt(self.alpha_by_step[step])
def gaussian_noise(n_var: int, length: int) -> torch.Tensor:
"""Generate a Gaussian noise tensor.
:param n_var: Number of variables.
:param length: Length of the tensor.
"""
return torch.normal(mean=0, std=1, size=(n_var, length))
diffusion_process_params = DiffusionPocessParams(
steps=100,
beta=0.005,
# beta=0,
)
diffusion_batch_size = 1000
# diffusion_batch_size = 2
noise = gaussian_noise(diffusion_batch_size, diffusion_process_params.steps)
diffusion_process = DiffusionProcess(diffusion_process_params, noise=noise)
# diffusion_initial_x = torch.sin(
# torch.linspace(0, 1, diffusion_batch_size)
# .reshape(diffusion_batch_size)
# )
diffusion_initial_x = torch.rand(diffusion_batch_size)
# diffusion_initial_x = (
# torch.distributions.Beta(torch.tensor([0.5]), torch.tensor([0.5]))
# .sample((diffusion_batch_size, 1))
# .reshape(diffusion_batch_size)
# )
diffusion_initial_x
Forward process step by step¶
diffusion_steps_step_by_step = [diffusion_initial_x.detach().numpy()]
for i in range(0, diffusion_process_params.steps):
logger.info(f"step {i}")
i_state = (
diffusion_process._forward_process_by_step(
torch.from_numpy(diffusion_steps_step_by_step[-1]), step=i
)
.detach()
.numpy()
)
logger.info(f"i_state {i_state[:2]}")
diffusion_steps_step_by_step.append(i_state)
px.histogram(diffusion_initial_x)
px.histogram(diffusion_steps_step_by_step[0])
px.histogram(diffusion_steps_step_by_step[-1])
Reverse step by step¶
diffusion_steps_reverse = [diffusion_steps_step_by_step[-1]]
for i in range(diffusion_process_params.steps - 1, -1, -1):
logger.info(f"step {i}")
i_state = (
diffusion_process._inverse_process_by_step(
torch.from_numpy(diffusion_steps_reverse[-1]), step=i
)
.detach()
.numpy()
)
logger.info(f"i_state {i_state[:2]}")
diffusion_steps_reverse.append(i_state)
px.histogram(diffusion_steps_reverse[0])
px.histogram(diffusion_steps_reverse[-1])
Diffusion Distributions¶
df_diffusion_example = pd.DataFrame(
{i: v for i, v in enumerate(diffusion_steps_step_by_step)}
).T
df_diffusion_example["step"] = df_diffusion_example.index
df_diffusion_example_melted = df_diffusion_example.melt(
id_vars=["step"], var_name="variable", value_name="value"
)
df_diffusion_example_melted.tail()
px.histogram(
df_diffusion_example_melted,
x="value",
histnorm="probability density",
animation_frame="step",
)
px.violin(
df_diffusion_example_melted.loc[
df_diffusion_example_melted["step"].isin(
[0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
)
],
x="step",
y="value",
)
px.line(
df_diffusion_example_melted,
x="step",
y="value",
color="variable",
)
Create Visuals¶
import matplotlib.pyplot as plt
import seaborn as sns
_, ax = plt.subplots(figsize=(10, 6))
sns.histplot(
df_diffusion_example_melted.loc[df_diffusion_example_melted["step"] == 0],
x="value",
stat="probability",
color="k",
label="Initial Distribution",
ax=ax,
)
ax.set_title("Initial Distribution")
ax.set_xlabel("Position")
_, ax = plt.subplots(figsize=(10, 6))
sns.histplot(
df_diffusion_example_melted.loc[
df_diffusion_example_melted["step"] == max(df_diffusion_example_melted["step"])
],
x="value",
stat="probability",
color="k",
ax=ax,
)
ax.set_title("Final Distribution")
ax.set_xlabel("Position")
df_diffusion_example_melted
_, ax = plt.subplots(figsize=(15, 6))
sns.lineplot(
df_diffusion_example_melted.loc[
df_diffusion_example_melted["variable"].isin(
[0, 100, 200, 300, 400, 500, 600, 700, 800, 900]
)
].rename(columns={"variable": "particle id"}),
x="step",
y="value",
style="particle id",
)
ax.set_xlabel("Time Step")
ax.set_ylabel("Particle Position")
ridge_steps = 10
fig, ax = plt.subplots(figsize=(8, 6))
colors = plt.cm.viridis(np.linspace(0, 1, ridge_steps))
bin_edges = np.histogram_bin_edges(df_diffusion_example_melted["value"], bins="auto")
for i, step in enumerate([0, 10, 20, 30, 40, 50, 60, 70, 80, 90]):
values = df_diffusion_example_melted[df_diffusion_example_melted["step"] == step][
"value"
]
counts, _ = np.histogram(values, bins=bin_edges)
counts = counts / counts.max()
offset = i * 1.2
ax.fill_between(
bin_edges[:-1], offset, counts + offset, step="mid", color=colors[i], alpha=0.7
)
ax.text(bin_edges[-1] + 0.2, offset, step, va="center")
ax.axes.get_yaxis().set_visible(False)
plt.axis("off")
ax.set_xlabel("Position")
plt.tight_layout()
Model¶
We create a naive model based on the idea of diffusion.
- Connect the real data to the latent space through diffusion process.
- We forecast in the latent space.
import lightning.pytorch as pl
from lightning import LightningModule
@dataclasses.dataclass
class LatentRNNParams:
"""Parameters for Diffusion process.
:param latent_size: latent space dimension
:param history_length: input sequence length
:param n_features: number of features
"""
history_length: int
latent_size: int = 100
num_layers: int = 2
n_features: int = 1
initial_state: torch.Tensor = None
@cached_property
def data_size(self) -> int:
"""The dimension of the input data
when flattened.
"""
return self.sequence_length * self.n_features
def asdict(self) -> dict:
return dataclasses.asdict(self)
class LatentRNN(nn.Module):
"""Forecasting the next step in latent space."""
def __init__(self, params: LatentRNNParams):
super().__init__()
self.params = params
self.hparams = params.asdict()
self.rnn = nn.GRU(
input_size=self.params.history_length,
hidden_size=self.params.latent_size,
num_layers=self.params.num_layers,
batch_first=True,
)
def forward(
self, x: torch.Tensor
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
"""
:param x: input data, shape (batch_size, history_length * n_features)
"""
outputs, _ = self.rnn(x, self.params.initial_state)
return outputs
class DiffusionEncoder(nn.Module):
"""Encode the time series into the latent space."""
def __init__(
self,
params: DiffusionPocessParams,
noise: torch.Tensor,
):
super().__init__()
self.params = params
self.noise = noise
@staticmethod
def _forward_process_by_step(
state: torch.Tensor, alpha_by_step: torch.Tensor, noise: torch.Tensor, step: int
) -> torch.Tensor:
r"""Assuming that we know the noise at step $t$,
$$
x(t) = \sqrt{\alpha(t)}x(t-1) + \sqrt{1 - \alpha(t)}\epsilon(t)
$$
"""
batch_size = state.shape[0]
return torch.sqrt(alpha_by_step[step]) * state + (
torch.sqrt(1 - alpha_by_step[step]) * noise[:batch_size, step]
).reshape(batch_size, 1)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Encoding the latent space into a distribution.
:param x: input data, shape (batch_size, history_length, n_features)
"""
alpha_by_step = torch.tensor(self.params.alpha_by_step).to(x)
self.noise = self.noise.to(x)
# logger.debug(
# f"alpha_by_step: {alpha_by_step.shape}"
# f"noise: {self.noise.shape}"
# f"x: {x.shape}"
# )
diffusion_steps_step_by_step = [x]
for i in range(0, self.params.steps):
i_state = self._forward_process_by_step(
diffusion_steps_step_by_step[-1],
alpha_by_step=alpha_by_step,
noise=self.noise,
step=i,
)
diffusion_steps_step_by_step.append(i_state)
return diffusion_steps_step_by_step[-1]
class DiffusionDecoder(nn.Module):
"""Decode the latent space into a distribution."""
def __init__(
self,
params: DiffusionPocessParams,
noise: torch.Tensor,
):
super().__init__()
self.params = params
self.noise = noise
@staticmethod
def _inverse_process_by_step(
state: torch.Tensor, alpha_by_step: torch.Tensor, noise: torch.Tensor, step: int
) -> torch.Tensor:
r"""Assuming that we know the noise at step $t$,
$$
x(t-1) = \frac{1}{\sqrt{\alpha(t)}}
(x(t) - \sqrt{1 - \alpha(t)}\epsilon(t))
$$
"""
batch_size = state.shape[0]
return (
state
- (torch.sqrt(1 - alpha_by_step[step]) * noise[:batch_size, step]).reshape(
batch_size, 1
)
) / torch.sqrt(alpha_by_step[step])
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Encoding the latent space into a distribution.
:param x: input data, shape (batch_size, history_length, n_features)
"""
alpha_by_step = torch.tensor(self.params.alpha_by_step).to(x)
self.noise = self.noise.to(x)
diffusion_steps_reverse = [x]
for i in range(self.params.steps - 1, -1, -1):
i_state = self._inverse_process_by_step(
state=diffusion_steps_reverse[-1],
alpha_by_step=alpha_by_step,
noise=self.noise,
step=i,
)
diffusion_steps_reverse.append(i_state)
return diffusion_steps_reverse[-1]
class NaiveDiffusionModel(nn.Module):
"""A naive diffusion model that explicitly calculates
the diffusion process.
"""
def __init__(
self,
rnn: LatentRNN,
diffusion_decoder: DiffusionDecoder,
diffusion_encoder: DiffusionEncoder,
horizon: int = 1,
):
super().__init__()
self.rnn = rnn
self.diffusion_decoder = diffusion_decoder
self.diffusion_encoder = diffusion_encoder
self.horizon = horizon
self.scale = nn.Linear(
in_features=self.rnn.params.latent_size,
out_features=self.horizon,
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
# logger.debug(f"x.squeeze(-1): {x.squeeze(-1).shape=}")
x_latent = self.diffusion_encoder(x.squeeze(-1))
# logger.debug(f"x_latent: {x_latent.shape=}")
y_latent = self.rnn(x_latent)
# logger.debug(f"y_latent: {y_latent.shape=}")
y_hat = self.diffusion_decoder(y_latent)
# logger.debug(f"y_hat: {y_hat.shape=}")
y_hat = self.scale(y_hat)
# logger.debug(f"scaled y_hat: {y_hat.shape=}")
return y_hat
class NaiveDiffusionForecaster(LightningModule):
"""A assembled lightning module for the naive diffusion model."""
def __init__(
self,
model: NaiveDiffusionModel,
loss: nn.Module = nn.MSELoss(),
):
super().__init__()
self.model = model
self.loss = loss
def configure_optimizers(self):
optimizer = torch.optim.SGD(self.parameters(), lr=1e-3)
return optimizer
def training_step(self, batch, batch_idx):
x, y = batch
x = x.type(self.dtype)
y = y.type(self.dtype)
batch_size = x.shape[0]
y_hat = self.model(x)[:batch_size, :].reshape_as(y)
loss = self.loss(y_hat, y).mean()
self.log_dict({"train_loss": loss}, prog_bar=True)
return loss
def validation_step(self, batch, batch_idx):
x, y = batch
x = x.type(self.dtype)
y = y.type(self.dtype)
batch_size = x.shape[0]
y_hat = self.model(x)[:batch_size, :].reshape_as(y)
loss = self.loss(y_hat, y).mean()
self.log_dict({"val_loss": loss}, prog_bar=True)
return loss
def predict_step(self, batch, batch_idx):
x, y = batch
x = x.type(self.dtype)
y = y.type(self.dtype)
batch_size = x.shape[0]
y_hat = self.model(x)[:batch_size, :].reshape_as(y)
return x, y_hat
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = x.to(self.model.rnn.rnn.weight_ih_l0)
return self.model(x)
df = pd.DataFrame(
{"t": np.linspace(0, 100, 501), "y": np.sin(np.linspace(0, 100, 501))}
)
_, ax = plt.subplots(figsize=(10, 6.18))
df.plot(x="t", y="y", ax=ax)
Traning¶
from ts_bolt.datamodules.pandas import DataFrameDataModule
history_length_1_step = 100
horizon_1_step = 1
training_batch_size = 64
training_noise = gaussian_noise(training_batch_size, diffusion_process_params.steps)
diffusion_process_params.alpha_by_step.shape, training_noise.shape
test_state = torch.rand(training_batch_size, diffusion_process_params.steps)
test_state.shape
torch.sqrt(torch.from_numpy(diffusion_process_params.alpha_by_step)[0])
(
test_state
- torch.sqrt(torch.from_numpy(diffusion_process_params.alpha_by_step)[0])
* training_noise[:, 0].reshape(training_batch_size, 1)
).shape
pdm_1_step = DataFrameDataModule(
history_length=history_length_1_step,
horizon=horizon_1_step,
dataframe=df[["y"]].astype(np.float32),
batch_size=training_batch_size,
)
diffusion_decoder = DiffusionDecoder(diffusion_process_params, training_noise)
diffusion_encoder = DiffusionEncoder(diffusion_process_params, training_noise)
latent_rnn_params = LatentRNNParams(
history_length=history_length_1_step,
latent_size=diffusion_process_params.steps,
)
latent_rnn = LatentRNN(latent_rnn_params)
naive_diffusion_model = NaiveDiffusionModel(
rnn=latent_rnn,
diffusion_decoder=diffusion_decoder,
diffusion_encoder=diffusion_encoder,
)
naive_diffusion_forecaster = NaiveDiffusionForecaster(
model=naive_diffusion_model.float(),
)
naive_diffusion_forecaster
logger_1_step = pl.loggers.TensorBoardLogger(
save_dir="lightning_logs", name="naive_diffusion_ts_1_step"
)
precision = "64"
trainer_1_step = pl.Trainer(
# precision="32",
precision=precision,
# max_epochs=5000,
max_epochs=10000,
min_epochs=5,
# callbacks=[
# pl.callbacks.early_stopping.EarlyStopping(monitor="val_loss", mode="min", min_delta=1e-8, patience=4)
# ],
logger=logger_1_step,
# accelerator="mps",
accelerator="cuda",
)
trainer_1_step.fit(model=naive_diffusion_forecaster, datamodule=pdm_1_step)
Evaluation¶
from typing import Dict, List, Sequence, Tuple
import matplotlib as mpl
import numpy as np
import pandas as pd
import torch
from torch.utils.data import DataLoader
from torchmetrics import MetricCollection
from torchmetrics.regression import (
MeanAbsoluteError,
MeanAbsolutePercentageError,
MeanSquaredError,
SymmetricMeanAbsolutePercentageError,
)
from ts_bolt.evaluation.evaluator import Evaluator
from ts_bolt.naive_forecasters.last_observation import LastObservationForecaster
class Evaluator:
"""Evaluate the predictions
:param step: which prediction step to be evaluated.
:param gap: gap between input history and target/prediction.
"""
def __init__(self, step: int = 0, gap: int = 0):
self.step = step
self.gap = gap
@staticmethod
def get_one_history(
predictions: Sequence[Sequence], idx: int, batch_idx: int = 0
) -> torch.Tensor:
return predictions[batch_idx][0][idx, ...]
@staticmethod
def get_one_pred(predictions: List, idx: int, batch_idx: int = 0) -> torch.Tensor:
return predictions[batch_idx][1][idx, ...]
@staticmethod
def get_y(predictions: List, step: int) -> List[torch.Tensor]:
return [i[1][..., step] for i in predictions]
def y(self, predictions: List, batch_idx: int = 0) -> torch.Tensor:
return self.get_y(predictions, self.step)[batch_idx].detach()
@staticmethod
def get_y_true(dataloader: DataLoader, step: int) -> list[torch.Tensor]:
return [i[1][..., step] for i in dataloader]
def y_true(self, dataloader: DataLoader, batch_idx: int = 0) -> torch.Tensor:
return self.get_y_true(dataloader, step=self.step)[batch_idx].detach()
def get_one_sample(
self, predictions: List, idx: int, batch_idx: int = 0
) -> Tuple[torch.Tensor, torch.Tensor]:
return (
self.get_one_history(predictions, idx, batch_idx),
self.get_one_pred(predictions, idx, batch_idx),
)
def plot_one_sample(
self, ax: mpl.axes.Axes, predictions: List, idx: int, batch_idx: int = 0
) -> None:
history, pred = self.get_one_sample(predictions, idx, batch_idx)
x_raw = np.arange(len(history) + len(pred) + self.gap)
x_history = x_raw[: len(history)]
x_pred = x_raw[len(history) + self.gap :]
x = np.concatenate([x_history, x_pred])
y = np.concatenate([history, pred])
ax.plot(x, y, marker=".", label=f"input ({idx})")
ax.axvspan(x_pred[0], x_pred[-1], color="orange", alpha=0.1)
@property
def metric_collection(self) -> MetricCollection:
return MetricCollection(
MeanAbsoluteError(),
MeanAbsolutePercentageError(),
MeanSquaredError(),
SymmetricMeanAbsolutePercentageError(),
)
@staticmethod
def metric_dataframe(metrics: Dict) -> pd.DataFrame:
return pd.DataFrame(
[{k: float(v) for k, v in metrics.items()}], index=["values"]
).T
def metrics(
self, predictions: List, dataloader: DataLoader, batch_idx: int = 0
) -> pd.DataFrame:
truths = self.y_true(dataloader)
preds = self.y(predictions, batch_idx=batch_idx)
return self.metric_dataframe(self.metric_collection(preds, truths))
evaluator_1_step = Evaluator(step=0)
predictions_1_step = trainer_1_step.predict(
model=naive_diffusion_forecaster, datamodule=pdm_1_step
)
fig, ax = plt.subplots(figsize=(10, 6.18))
ax.plot(
evaluator_1_step.y_true(dataloader=pdm_1_step.predict_dataloader()),
"g-",
label="truth",
)
ax.plot(evaluator_1_step.y(predictions_1_step), "r--", label="predictions")
# ax.plot(evaluator_1_step.y(lobs_1_step_predictions), "b-.", label="naive predictions")
plt.legend()
trainer_naive_1_step = pl.Trainer(precision=precision)
lobs_forecaster_1_step = LastObservationForecaster(horizon=horizon_1_step)
lobs_1_step_predictions = trainer_naive_1_step.predict(
model=lobs_forecaster_1_step, datamodule=pdm_1_step
)
fig, ax = plt.subplots(figsize=(10, 6.18))
ax.plot(
evaluator_1_step.y_true(dataloader=pdm_1_step.predict_dataloader()),
"g-",
label="truth",
)
ax.plot(evaluator_1_step.y(predictions_1_step), "r--", label="predictions")
ax.plot(evaluator_1_step.y(lobs_1_step_predictions), "b-.", label="naive predictions")
plt.legend()
evaluator_1_step.metrics(predictions_1_step, pdm_1_step.predict_dataloader())
evaluator_1_step.metrics(
[[i.unsqueeze(-1) for i in lobs_1_step_predictions[0]]],
pdm_1_step.predict_dataloader(),
)
Time Series Data and Embeddings¶
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
def plot_arrow_chart(
dataframe: pd.DataFrame,
x_col: str,
y_col: str,
ax: plt.Axes,
color: str = "k",
alpha: float = 0.7,
marker: str = ".",
linestyle: str = "-",
arrow_head_width: int = 4000,
) -> plt.Axes:
"""
Plot an arrow chart for the 'Total' and its lagged values
within a specified date range.
"""
x = dataframe[x_col].values
y = dataframe[y_col].values
ax.plot(x, y, marker=marker, linestyle=linestyle, color=color, alpha=alpha)
step = max(1, len(x) // 100)
for i in range(0, len(x) - 1, step):
ax.arrow(
x[i],
y[i],
x[i + 1] - x[i],
y[i + 1] - y[i],
shape="full",
lw=0,
length_includes_head=True,
head_width=arrow_head_width,
color=color,
alpha=alpha,
)
return ax
Pendulum¶
from ts_dl_utils.datasets.pendulum import Pendulum
pen = Pendulum(length=20)
df_pen = pd.DataFrame(pen(3, 100, initial_angle=1, beta=0.01))
df_pen["theta_1"] = df_pen["theta"].shift()
df_pen["theta_diff"] = df_pen["theta"].diff()
df_pen
fig = plt.figure(figsize=(10, 8), layout="constrained")
spec = fig.add_gridspec(2, 2)
ax0 = fig.add_subplot(spec[0, :])
ax10 = fig.add_subplot(spec[1, 0])
ax11 = fig.add_subplot(spec[1, 1])
ax0.plot(
df_pen.t,
df_pen.theta,
marker=".",
linestyle="-",
color="k",
)
# Make x-ticks readable
ax0.xaxis.set_major_locator(plt.MaxNLocator(8))
# fig.autofmt_xdate(rotation=30)
ax0.set_title("Swing Angle")
ax10 = plot_arrow_chart(
df_pen, x_col="theta", y_col="theta_1", ax=ax10, arrow_head_width=0.00001
)
ax10.set_xlabel("Swing Angle")
ax10.set_ylabel("Swing Angle 0.05 seconds ago")
ax10.set_title("Swing Angle and Angle 0.05 seconds ago")
ax11 = plot_arrow_chart(
df_pen, x_col="theta", y_col="theta_diff", ax=ax11, arrow_head_width=0.00001
)
ax11.set_xlabel("Swing Angle")
ax11.set_ylabel("Swing Angle Change Rate")
ax11.set_title("Phase Portrait")
plt.tight_layout()
Covid¶
df_ecdc_covid = pd.read_csv(
"https://gist.githubusercontent.com/emptymalei/"
"90869e811b4aa118a7d28a5944587a64/raw"
"/1534670c8a3859ab3a6ae8e9ead6795248a3e664"
"/ecdc%2520covid%252019%2520data"
)
px.line(df_ecdc_covid, x="datetime", y="Total")
df_ecdc_covid
df_ecdc_covid_arrow_chart = df_ecdc_covid.loc[
pd.to_datetime(df_ecdc_covid.datetime).between("2020-08-01", "2020-12-01")
].copy()
df_ecdc_covid_arrow_chart["Total_1"] = df_ecdc_covid_arrow_chart["Total"].shift()
df_ecdc_covid_arrow_chart["Total_diff"] = df_ecdc_covid_arrow_chart["Total"].diff()
fig = plt.figure(figsize=(10, 8), layout="constrained")
spec = fig.add_gridspec(2, 2)
ax0 = fig.add_subplot(spec[0, :])
ax10 = fig.add_subplot(spec[1, 0])
ax11 = fig.add_subplot(spec[1, 1])
ax0.plot(
df_ecdc_covid_arrow_chart.datetime,
df_ecdc_covid_arrow_chart.Total,
marker=".",
linestyle="-",
color="k",
)
# Make x-ticks readable
ax0.xaxis.set_major_locator(plt.MaxNLocator(8))
# fig.autofmt_xdate(rotation=30)
ax0.set_title("Covid Cases in EU Over Time")
ax10 = plot_arrow_chart(
df_ecdc_covid_arrow_chart, x_col="Total", y_col="Total_1", ax=ax10
)
ax10.set_xlabel("Total Cases")
ax10.set_ylabel("Total Cases Lagged by 1 Day")
ax10.set_title("Covid Cases and Lagged Values")
ax11 = plot_arrow_chart(
df_ecdc_covid_arrow_chart, x_col="Total", y_col="Total_diff", ax=ax11
)
ax11.set_xlabel("Total Cases")
ax11.set_ylabel("Total Cases Change")
ax11.set_title("Covid Cases in EU Phase Portrait")
ax11.set_ylim(-100_000, 100_000)
plt.tight_layout()
Walmart¶
df_walmart = pd.read_csv(
"https://raw.githubusercontent.com/datumorphism/"
"dataset-m5-simplified/refs/heads/main/dataset/"
"m5_store_sales.csv"
)
df_walmart
px.line(df_walmart, x="date", y="CA")
df_walmart_total = df_walmart[["date", "CA", "TX", "WI"]].copy()
df_walmart_total["total"] = (
df_walmart_total.CA + df_walmart_total.TX + df_walmart_total.WI
)
df_walmart_total["datetime"] = pd.to_datetime(df_walmart_total.date, format="%Y-%m-%d")
df_walmart_total["timestamp"] = df_walmart_total.datetime.astype(int) // 10**9
df_walmart_total["total_1"] = df_walmart_total.total.shift()
df_walmart_total["total_diff"] = df_walmart_total.total.diff()
px.scatter(
df_walmart_total.loc[pd.to_datetime(df_walmart_total.date).dt.year == 2016],
x="total",
y="total_1",
color="timestamp",
)
df_walmart_arrow_chart = df_walmart_total.loc[
pd.to_datetime(df_walmart_total.date).between("2016-01-01", "2016-03-01")
].copy()
fig = plt.figure(figsize=(10, 8), layout="constrained")
spec = fig.add_gridspec(2, 2)
ax0 = fig.add_subplot(spec[0, :])
ax10 = fig.add_subplot(spec[1, 0])
ax11 = fig.add_subplot(spec[1, 1])
ax0.plot(
df_walmart_arrow_chart.datetime,
df_walmart_arrow_chart.total,
marker=".",
linestyle="-",
color="k",
)
# Make x-ticks readable
ax0.xaxis.set_major_locator(plt.MaxNLocator(8))
# fig.autofmt_xdate(rotation=30)
ax0.set_title("Walmart Sales Over Time")
ax10 = plot_arrow_chart(
df_walmart_arrow_chart,
x_col="total",
y_col="total_1",
ax=ax10,
arrow_head_width=500,
)
ax10.set_xlabel("Total Sales")
ax10.set_ylabel("Total Sales Lagged by 1 Day")
ax10.set_title("Walmart Sales and Lagged Sales")
ax11 = plot_arrow_chart(
df_walmart_arrow_chart,
x_col="total",
y_col="total_diff",
ax=ax11,
arrow_head_width=500,
)
ax11.set_xlabel("Total Sales")
ax11.set_ylabel("Total Sales Change")
ax11.set_title("Walmart Sales Phase Portrait")
plt.tight_layout()
Electricity Data¶
import io
import zipfile
import pandas as pd
import requests
# Download from remote URL
data_uri = "https://archive.ics.uci.edu/ml/machine-learning-databases/00321/LD2011_2014.txt.zip"
r = requests.get(data_uri)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("tmp/data/uci_electricity/")
# Load as pandas dataframe
df_electricity = (
pd.read_csv("tmp/data/uci_electricity/LD2011_2014.txt", delimiter=";", decimal=",")
.rename(columns={"Unnamed: 0": "date"})
.set_index("date")
)
df_electricity.index = pd.to_datetime(df_electricity.index)
df_electricity
df_electricity.loc[
(df_electricity.index >= "2012-01-01") & (df_electricity.index < "2012-02-01")
][["MT_001"]].plot()
Ended: Notebooks
Small Yet Powerful Concepts ↵
Appendices¶
Many formulae and concepts are reused again and again in different models. In this chapter, we will provide support for such concepts.
Entropy¶
Shannon Entropy¶
Shannon entropy \(S\) is the expectation of information content \(I(X)=-\log \left(p\right)\)1,
Cross Entropy¶
Cross entropy is2
Cross entropy \(H(p, q)\) can also be decomposed,
where \(H(p)\) is the entropy of \(P\) and \(\operatorname{D}_{\mathrm{KL}}\) is the KL Divergence.
Cross entropy is widely used in classification problems, e.g., logistic regression.
-
Contributors to Wikimedia projects. Entropy (information theory). In: Wikipedia [Internet]. 29 Aug 2021 [cited 4 Sep 2021]. Available: https://en.wikipedia.org/wiki/Entropy_(information_theory) ↩
-
Contributors to Wikimedia projects. Cross entropy. In: Wikipedia [Internet]. 4 Jul 2021 [cited 4 Sep 2021]. Available: https://en.wikipedia.org/wiki/Cross_entropy ↩
Mutual Information¶
Mutual information is
Mutual information is closed related to KL divergence,
KL Divergence¶
The Kullback–Leibler (KL) divergence is defined as
Suppose \(p\) is a Gaussian distribution and \(q\) is a bimodal Gaussian mixture, the KL divergence \(\operatorname{D}_\mathrm{KL}(p \parallel q )\) and \(\operatorname{D}_\mathrm{KL}(q \parallel p )\) are different as KL divergence is not necessarily symmetric. Thus the KL divergence is not a proper distance definition.

KL divergence is a special case of f-divergence.
f-Divergence¶
The f-divergence is defined as1
where \(p\) and \(q\) are two densities and \(\mu\) is a reference distribution.
Requirements on the generating function
The generating function \(f\) is required to
- be convex, and
- \(f(1) =0\).
For \(f(x) = x \log x\) with \(x=p/q\), f-divergence is reduced to the KL divergence
For more special cases of f-divergence, please refer to wikipedia1. Nowozin 2016 also provides a concise review of f-divergence2.
-
Contributors to Wikimedia projects. F-divergence. In: Wikipedia [Internet]. 17 Jul 2021 [cited 4 Sep 2021]. Available: https://en.wikipedia.org/wiki/F-divergence ↩↩
-
Nowozin S, Cseke B, Tomioka R. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. arXiv [stat.ML]. 2016. Available: http://arxiv.org/abs/1606.00709 ↩
ELBO¶
Given a probability distribution density \(p(x)\) and a latent variable \(z\), the marginalization of the joint probability is
Using Jensen's Inequality¶
In many models, we are interested in the log probability density \(\log p(X)\) which can be decomposed using an auxiliary density of the latent variable \(q(Z)\),
Jensen's Inequality
Jensen's inequality shows that1
as \(\log\) is a concave function.
Applying Jensen's inequality,
Using the definition of entropy and cross entropy, we know that
is the entropy of \(q(z)\), and
is the cross entropy. We define
which is called the evidence lower bound (ELBO). It is a lower bound because
Using KL Divergence¶
In a latent variable model, we need the posterior \(p(z|x)\). When this is intractable, we find an approximation \(q(z|\theta)\) where \(\theta\) is the parametrization, e.g., neural network parameters. To make sure we have a good approximation of the posterior, we require the KL divergence of \(q(z|\theta)\) and \(p(z|z)\) to be small. The KL divergence in this situation is2
Since \(\operatorname{D}_{\text{KL}}(q(z|\theta)\parallel p(z|x))\geq 0\), we have
which also indicates that \(L\) is the lower bound of \(\log p(x)\).
Jensen gap
The difference between \(\log p(x)\) and \(L\) is the Jensen gap, i.e.,
-
Contributors to Wikimedia projects. Jensen’s inequality. In: Wikipedia [Internet]. 27 Aug 2021 [cited 5 Sep 2021]. Available: https://en.wikipedia.org/wiki/Jensen%27s_inequality ↩
-
Yang X. Understanding the Variational Lower Bound. 14 Apr 2017 [cited 5 Sep 2021]. Available: https://xyang35.github.io/2017/04/14/variational-lower-bound/ ↩
Alignment and Uniformity¶
A good representation should be able to
- separate different instances, and
- cluster similar instances.
Wang et al proposed two concepts that matches the above two ideas, alignment and uniformity, on a hypersphere1.

From Wang et al1.
-
Wang T, Isola P. Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. arXiv [cs.LG]. 2020. Available: http://arxiv.org/abs/2005.10242 ↩↩
Gini Impurity¶
Suppose we have a dataset \(\{0,1\}^{10}\), which has 10 records and 2 possible classes of objects \(\{0,1\}\) in each record.
The first example we investigate is a pure 0 dataset.
| object |
|---|
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
For such an all-0 dataset, we would like to define its impurity as 0. Same with an all-1 dataset. For a dataset with 50% of 1 and 50% of 0, we would define its impurity as max due to the symmetries between 0 and 1.
Definition¶
Given a dataset \(\{0,1,...,d\}^n\), the Gini impurity is calculated as
where \(p(i)\) is the probability of a randomly picked record being class \(i\).
In the above example, we have two classes, \(\{0,1\}\). The probabilities are
The Gini impurity is
Examples¶
Suppose we have another dataset with 50% of the values being 50%.
| object |
|---|
| 0 |
| 0 |
| 1 |
| 0 |
| 0 |
| 1 |
| 1 |
| 1 |
| 0 |
| 0 |
| 0 |
| 1 |
The Gini impurity is
For data with two possible values \(\{0,1\}\), the maximum Gini impurity is 0.25. The following chart shows all the possible values of the Gini impurity for a two-value dataset.
The following heatmap shows the Gini impurity for data with two possible values. The color indicates the Gini impurity.

For data with three possible values, the Gini impurity is also visualized using the same chart given the condition that \(p_3 = 1 - p_1 - p_2\).
The following chart shows the Gini impurity for data with three possible values. The color indicates the Gini impurity.

Information Gain¶
Information gain is a frequently used metric in calculating the gain during a split in tree-based methods.
First of all, the entropy of a dataset is defined as
where \(p_i\) is the probability of a class.
The information gain is the change of entropy.
To illustrate this idea, we use decision tree as an example. In a decision tree algorithm, we would split a node. Before splitting, we assign a label \(m\) to the node, the entropy is
After the splitting, we have two groups that contributes to the entropy, group \(L\) and group \(R\) 1,
where \(p_L\) and \(p_R\) are the probabilities of the two groups. Suppose we have 100 samples before splitting and 29 samples in the left group and 71 samples in the right group, we have \(p_L = 29/100\) and \(p_R = 71/100\).
The information gain is the difference between \(S_m\) and \(S'_m\),
-
Shalev-Shwartz S, Ben-David S. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014 doi:10.1017/CBO9781107298019. ↩
Generalization¶
To measure the generalization, we define a generalization error1,
where \(\mathcal L_{P}\) is the population loss, \(\mathcal L_E\) is the empirical loss, and \(\hat f\) is our model by minimizing the empirical loss.
However, we do not know the actual joint probability \(p(x, y)\) of our dataset \(\\{x_i, y_i\\}\). Thus the population loss is not known. In machine learning, we usually use cross validation where we split our dataset into train and test dataset. We approximate the population loss using the test dataset.
-
Roelofs R. Measuring generalization and overfitting in machine learning. 2019.https://escholarship.org/uc/item/6j01x9mz. ↩
Dynamic Time Warping (DTW)¶
Given two sequences, \(S^{(1)}\) and \(S^{(2)}\), the Dynamic Time Warping (DTW) algorithm finds the best way to align two sequences. During this alignment process, we quantify the misalignment using a distance similar to the Levenshtein distance, where the distance between two series \(S^{(1)}_{1:i}\) (with \(i\) elements) and \(S^{(2)}_{1:j}\) (with \(j\) elements) is3
where \(S^{(1)}_i\) is the \(i\)the element of the series \(S^{(1)}\), \(d(x,y)\) is a predetermined distance, e.g., Euclidean distance. This definition reveals the recursive nature of the DTW distance.
Notations in the Definition: \(S_{1:i}\) and \(S_{i}\)
The notation \(S_{1:i}\) stands for a series that contains the elements starting from the first to the \(i\)th in series \(S\). For example, we have a series
The notation \(S^1_{1:4}\) represents
The notation \(S_i\) indicates the \(i\)th element in \(S\). For example,
If we map these two notations to Python,
- \(S_{1:i}\) is equivalent to
S[0:i], and - \(S_i\) is equivalent to
S[i-1].
Note that the indices in Python look strange. This is also the reason we choose to use subscripts not square brackets in our definition.
Levenshtein Distance
Given two words, e.g., \(w^{a} = \mathrm{cats}\) and \(w^{b} = \mathrm{katz}\). Suppose we can only use three operations: insertions, deletions and substitutions. The Levenshtein distance calculates the number of such operations needed to change from the first word \(w^a\) to the second one \(w^b\) by applying single-character edits. In this example, we need two replacements, i.e., "c" -> "k" and "s" -> "z".
The Levenshtein distance can be solved using recursive algorithms 1.
DTW is very useful when comparing series with different lengths. For example, most error metrics require the actual time series and predicted series to have the same length. In the case of different lengths, we can perform DTW when calculating these metrics2.
Examples¶
The forecasting package darts provides a demo of DTW.
-
trekhleb. javascript-algorithms/src/algorithms/string/levenshtein-distance at master · trekhleb/javascript-algorithms. In: GitHub [Internet]. [cited 27 Jul 2022]. Available: https://github.com/trekhleb/javascript-algorithms/tree/master/src/algorithms/string/levenshtein-distance ↩
-
Unit8. Metrics — darts documentation. In: Darts [Internet]. [cited 7 Mar 2023]. Available: https://unit8co.github.io/darts/generated_api/darts.metrics.metrics.html?highlight=dtw#darts.metrics.metrics.dtw_metric ↩
-
Petitjean F, Ketterlin A, Gançarski P. A global averaging method for dynamic time warping, with applications to clustering. Pattern recognition 2011; 44: 678–693. ↩
DTW Barycenter Averaging¶
DTW Barycenter Averaging (DBA) constructs a series \(\bar{\mathcal S}\) out of a set of series \(\{\mathcal S^{(\alpha)}\}\) so that \(\bar{\mathcal S}\) is the barycenter of \(\{\mathcal S^{(\alpha)}\}\) measured by Dynamic Time Warping (DTW) distance 2.
Barycenter Averaging Based on DTW Distance¶
Petitjean et al proposed a time series averaging algorithm based on DTW distance which is dubbed DTW Barycenter Averaging (DBA).
DBA Implementation
-
trekhleb. javascript-algorithms/src/algorithms/string/levenshtein-distance at master · trekhleb/javascript-algorithms. In: GitHub [Internet]. [cited 27 Jul 2022]. Available: https://github.com/trekhleb/javascript-algorithms/tree/master/src/algorithms/string/levenshtein-distance ↩
-
Petitjean F, Ketterlin A, Gançarski P. A global averaging method for dynamic time warping, with applications to clustering. Pattern recognition 2011; 44: 678–693. ↩
Ended: Small Yet Powerful Concepts
Other Deep Learning Topics ↵
Contrastive ↵
Contrastive Models¶
Contrastive Models
Learn to compare.
-
Liu X, Zhang F, Hou Z, Wang Z, Mian L, Zhang J, et al. Self-supervised Learning: Generative or Contrastive. arXiv [cs.LG]. 2020. Available: http://arxiv.org/abs/2006.08218 ↩
Deep Infomax¶
Max Global Mutual Information
Why not just use the global mutual information of the input and encoder output as the objective?
... maximizing MI between the complete input and the encoder output (i.e.,globalMI) is ofteninsufficient for learning useful representations.
-- Devon et al1
Mutual information maximization is performed on the input of the encoder \(X\) and the encoded feature \(\hat X=E_\theta (X)\),
Being a quantity that is notoriously hard to compute, mutual information \(I(X;E_\theta (X))\) is usually estimated using its lower bound, which depends on a choice of a functional \(T_\omega\). Thus the objective will be maximizing a parametrized mutual information estimation,
Local or Global
Two approaches to apply mutual information on encoders:
- Global mutual information of full input and full encoding. This is useful for reconstruction of the input.
- Local mutual information of local patches of input full encoding. This is useful for classification.
Local Mutual Information¶
To compare local features to the encoder output, we need to extract values from inside the encoder, i.e.,
The first step, \(C_{\theta_C}\) is to map the input into feature maps, the second step, \(f_{\theta_f}\) maps the feature maps into the encoding. The feature map \(C_{\theta_C}\) is splitted into patches, \(C_{\theta_C}=\left\{ C_\theta^{(i)} \right\}\). The objective is

Why does local mutual information help
Devon et al explained the idea behind choosing local mutual information1.
Global mutual information doesn't specify what is the meaningful information. Some very local noise can also be treated as meaningful information too.
Local mutual information splits the input into patches, and calculate the mutual information between each patch and the encoding. If the model only uses some information from a few local patches, the mutual information objective will be small after averaging all the patches. Thus local mutual information forces the model to use information that is global in the input.
Code¶
- rdevon/DIM: by the authors
- DuaneNielsen/DeepInfomaxPytorch: a clean implementation
-
Devon Hjelm R, Fedorov A, Lavoie-Marchildon S, Grewal K, Bachman P, Trischler A, et al. Learning deep representations by mutual information estimation and maximization. arXiv [stat.ML]. 2018. Available: http://arxiv.org/abs/1808.06670 ↩↩
-
Newell A, Deng J. How Useful is Self-Supervised Pretraining for Visual Tasks? arXiv [cs.CV]. 2020. Available: http://arxiv.org/abs/2003.14323 ↩
Contrastive Predictive Coding¶
Contrastive Predictive Coding, CPC, is an autoregressive model combined with InfoNCE loss1.
Predictive Coding
As a related topic, predictive coding is a different scheme than backpropagation. Predictive coding updates the weights using local updating rules only2.
There are two key ideas in CPC:
- Autoregressive models in latent space, and
- InfoNCE loss that combines mutual information and NCE.
For the series of segments, \(\{x_t\}\), we apply an encoder on each segment, and calculate the latent space, \(\{{\color{blue}\hat x_t}\}\). The latent space \(\{{\color{blue}\hat x_t}\}\) is then modeled using an autoregressive model to calculate the coding, \(\{{\color{red}c_t}\}\).
The loss is built on NCE to estimate the lower bound of mutual information,
where \(f_k(x_{x+i}, c_t)\) is estimated using a log-bilinear model, \(f_k(x_{x+i}, c_t) = \exp\left( z_{t+i} W_i c_t \right)\). This is also a cross entropy loss.
Minimizing \(\mathcal L\) leads to a \(f_k\) that estimates the ratio1
We can perform downstream tasks such as classifications using the encoders.
Maximizing this lower bound?
This so-called lower bound for mutual information in this case is not always going to work[^Newell2020]. In some cases, the representations learned using this lower bound doesn't help or even worsen the performance of downstream tasks.
Code¶
-
van den Oord A, Li Y, Vinyals O. Representation learning with Contrastive Predictive Coding. arXiv [cs.LG]. 2018. Available: http://arxiv.org/abs/1807.03748 ↩↩
-
Millidge B, Tschantz A, Buckley CL. Predictive coding approximates backprop along arbitrary computation graphs. 2020.http://arxiv.org/abs/2006.04182. ↩
MADE: Masked Autoencoder for Distribution Estimation¶
MAF: Masked Autoregressive Flow¶
f-GAN¶
The essence of GAN is comparing the generated distribution \(p_G\) and the data distribution \(p_\text{data}\). The vanilla GAN considers the Jensen-Shannon divergence \(\operatorname{D}_\text{JS}(p_\text{data}\Vert p_{G})\). The discriminator \({\color{green}D}\) serves the purpose of forcing this divergence to be small.
Why do we need the discriminator?
If the JS divergence is an objective, why do we need the discriminator? Even in f-GAN we need a functional to approximate the f-divergence. This functional we choose works like the discriminator of GAN.
There exists a more generic form of JS divergence, which is called f-divergence1. f-GAN obtains the model by estimating the f-divergence between the data distribution and the generated distribution2.
Variational Divergence Minimization¶
The Variational Divergence Minimization (VDM) extends the variational estimation of f-divergence2. VDM searches for the saddle point of an objective \(F({\color{red}\theta}, {\color{blue}\omega})\), i.e., min w.r.t. \(\theta\) and max w.r.t \({\color{blue}\omega}\), where \({\color{red}\theta}\) is the parameter set of the generator \({\color{red}Q_\theta}\), and \({\color{blue}\omega}\) is the parameter set of the variational approximation to estimate f-divergence, \({\color{blue}T_\omega}\).
The objective \(F({\color{red}\theta}, {\color{blue}\omega})\) is related to the choice of \(f\) in f-divergence and the variational functional \({\color{blue}T}\),
In the above objective,
- \(f^*\) is the Legendre–Fenchel transformation of \(f\), i.e., \(f^*(t) = \operatorname{sup}_{u\in \mathrm{dom}_f}\left\{ ut - f(u) \right\}\).
\(T\)
The function \(T\) is used to estimate the lower bound of f-divergence2.
We estimate
- \(\mathbb E_{x\sim p_\text{data}}\) by sampling from the mini-batch, and
- \(\mathbb E_{x\sim {\color{red}Q_\theta} }\) by sampling from the generator.
Reduce to GAN
The VDM loss can be reduced to the loss of GAN by setting2
It is straightforward to validate that the following result is a solution to the above set of equations,
Code¶
-
Contributors to Wikimedia projects. F-divergence. In: Wikipedia [Internet]. 17 Jul 2021 [cited 6 Sep 2021]. Available: https://en.wikipedia.org/wiki/F-divergence#Instances_of_f-divergences ↩
-
Nowozin S, Cseke B, Tomioka R. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. arXiv [stat.ML]. 2016. Available: http://arxiv.org/abs/1606.00709 ↩↩↩↩↩
-
Contributors to Wikimedia projects. Convex conjugate. In: Wikipedia [Internet]. 20 Feb 2021 [cited 7 Sep 2021]. Available: https://en.wikipedia.org/wiki/Convex_conjugate ↩
InfoGAN¶
In GAN, the latent space input is usually random noise, e.g., Gaussian noise. The objective of GAN is a very generic one. It doesn't say anything about how exactly the latent space will be used. This is not desirable in many problems. We would like to have more interpretability in the latent space. InfoGAN introduced constraints to the objective to enforce interpretability of the latent space1.
Constraint¶
The constraint InfoGAN proposed is mutual information,
where
- \(c\) is the latent code,
- \(z\) is the random noise input,
- \(V({\color{green}D}, {\color{red}G})\) is the objective of GAN,
- \(I(c; {\color{red}G}(z,c))\) is the mutual information between the input latent code and generated data.
Using the lambda multiplier, we punish the model if the generator loses information in latent code \(c\).
Training¶

The training steps are almost the same as GAN but with one extra loss to be calculated in each mini-batch.
- Train \(\color{red}G\) using loss: \(\operatorname{MSE}(v', v)\);
- Train \(\color{green}D\) using loss: \(\operatorname{MSE}(v', v)\);
- Apply Constraint:
- Sample data from mini-batch;
- Calculate loss \(\lambda_{l} H(l';l)+\lambda_c \operatorname{MSE}(c,c')\)
Code¶
-
Chen X, Duan Y, Houthooft R, Schulman J, Sutskever I, Abbeel P. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. arXiv [cs.LG]. 2016. Available: http://arxiv.org/abs/1606.03657 ↩
-
Agakov DBF. The im algorithm: a variational approach to information maximization. Adv Neural Inf Process Syst. 2004. Available: https://books.google.com/books?hl=en&lr=&id=0F-9C7K8fQ8C&oi=fnd&pg=PA201&dq=Algorithm+variational+approach+Information+Maximization+Barber+Agakov&ots=TJGrkVS610&sig=yTKM2ZdcZQBTY4e5Vqk42ayUDxo ↩
Ended: Contrastive
Ended: Other Deep Learning Topics
Ended: Supplementary
About ↵
Roadmap¶
When I switched to data science, I built my digital garden, datumorphism. I deliberately designed this digital garden as my second brain. As a result, most of the articles are fragments of knowledge and require context to understand them.
Making bricks is easy but assembling them into a house is not easy. So I have decided to use this repository to practice my house-building techniques.
I do not have a finished blueprint yet. But I have a framework in my mind: I want to consolidate some of my thoughts and learnings in an organized way. However, I do not want to compile a reference book, as datumorphism already serves this purpose. I am thinking of an
Open Source¶
This is an open-source project on GitHub: emptymalei/deep-learning.
How do I Write It¶
I am trying out a more "agile" method. Instead of finishing the whole project at once, I will release the book by chapter. A few thoughts on this plan:
- Each new section should be a PR.
- Release on every new section.
How do I track the Progress¶
I use GitHub Projects. Here is my board.
